Re: Word co-occurrences counts

2004-12-28 Thread Andrew Cunningham
Thanks Doug,
This appears to works like a charm.
Doug Cutting wrote:
Doug Cutting wrote:
You could use a custom Similarity implementation for this query, 
where tf() is the identity function, idf() returns 1.0, etc., so that 
the final score is the occurance count.  You'll need to divide by 
Similarity.decodeNorm(indexReader.norms("field")[doc]) at the end to 
get rid of the lengthNorm() and field boost (if any).

Much simpler would be to build a SpanNearQuery, call getSpans(), then 
loop, counting how many times Spans.next() returns true.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Word co-occurrences counts

2004-12-23 Thread Erik Hatcher
On Dec 24, 2004, at 12:40 AM, Andrew Cunningham wrote:
3) and then:
   word in document count = 
hits.score(k)/Similarity.decodeNorm(reader.norms("contents")[k])
You should use hits.id(k), not k, as the index to 
reader.norms("contents").

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Word co-occurrences counts

2004-12-23 Thread Andrew Cunningham
Thanks Doug and all,
I'm intending to use Lucene to grab a lot of word co-occurance 
statistics out of a large corpus
to perform word disambiguation. Lucene's looking like a great option, 
but I appear to have hit
a snag. Here's my understanding:

1) Create a Similarity implementation, where:
   tf() returns freq
   sloppyFreq, idf, coord, return 1 (cause we only need to freq to score)
2) Perform the query
3) and then:
   word in document count = 
hits.score(k)/Similarity.decodeNorm(reader.norms("contents")[k])
4) A query call such as
   "computer dog"~50
   will return a count of 2 (I assume because the match occurs 
backwards and forwards).

My problem occurs when I have the following in a text file:
   computer ...(some words)... dog ...(some words)... computer
and I duplicate the text file several times over. Performing a the above 
query will return different
phrase counts per document?

Note: I'm just working with some modified demo code at the moment.
Thanks again,
Andrew
Doug Cutting wrote:
Andrew Cunningham wrote:
"computer dog"~50 looks like what I'm after - now is there someway I 
can call this and pull
out the number of total occurances, not just the number of documents 
hits? (say if computer
and dog occur near each other several times in the same document).

You could use a custom Similarity implementation for this query, where 
tf() is the identity function, idf() returns 1.0, etc., so that the 
final score is the occurance count.  You'll need to divide by 
Similarity.decodeNorm(indexReader.norms("field")[doc]) at the end to 
get rid of the lengthNorm() and field boost (if any).

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Word co-occurrences counts

2004-12-23 Thread Doug Cutting
Doug Cutting wrote:
You could use a custom Similarity implementation for this query, where 
tf() is the identity function, idf() returns 1.0, etc., so that the 
final score is the occurance count.  You'll need to divide by 
Similarity.decodeNorm(indexReader.norms("field")[doc]) at the end to get 
rid of the lengthNorm() and field boost (if any).
Much simpler would be to build a SpanNearQuery, call getSpans(), then 
loop, counting how many times Spans.next() returns true.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Word co-occurrences counts

2004-12-23 Thread Doug Cutting
Andrew Cunningham wrote:
"computer dog"~50 looks like what I'm after - now is there someway I can 
call this and pull
out the number of total occurances, not just the number of documents 
hits? (say if computer
and dog occur near each other several times in the same document).
You could use a custom Similarity implementation for this query, where 
tf() is the identity function, idf() returns 1.0, etc., so that the 
final score is the occurance count.  You'll need to divide by 
Similarity.decodeNorm(indexReader.norms("field")[doc]) at the end to get 
rid of the lengthNorm() and field boost (if any).

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Word co-occurrences counts

2004-12-23 Thread Andrew Cunningham
"computer dog"~50 looks like what I'm after - now is there someway I can 
call this and pull
out the number of total occurances, not just the number of documents 
hits? (say if computer
and dog occur near each other several times in the same document).

Paul Elschot wrote:
On Thursday 23 December 2004 07:50, [EMAIL PROTECTED] wrote:
 

Hi all,
I have a curious problem, and initial poking around with Lucene looks
like it may only be able to half-handle the problem.

The problem requires two abilities:
1.	To be able to return the number of times the word appears in all
the documents (which it looks like lucene can do through IndexReader) 
2.	To be able to return the number of word co-occurrences within
the document set (ie. How many times does "computer" appear within 50
words of  "dog") 


Is the second point possible?
   

You can use the standard query parser with a query like this:
"dog computer"~50
This query is not completely symmetric in the distance computation:
when computer occurs before dog, the allowed distance is 49, iirc.
There is also a SpanNearQuery for more generalized and flexible
distance queries, but this is not supported by the query parser,
so you'll have to construct these queries in your own program code.
In case you have non standard retrieval requirements, eg. you only
need the number of hits and no further information from the matching
documents, you may consider using your own HitCollector on the
lower level search methods.
Regards,
Paul Elschot
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Word co-occurrences counts

2004-12-23 Thread Andrew Cunningham
Ah, so is it possible to return the number of times a term appears?
Daniel Naber wrote:
On Thursday 23 December 2004 07:50, [EMAIL PROTECTED] wrote:
 

1.  To be able to return the number of times the word appears in all
the documents (which it looks like lucene can do through IndexReader)
   

If you're referring to docFreq(Term t) , that will only return the number 
of documents that contain the term, ignoring how often the term occurs in 
these documents.

Regards
Daniel
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Word co-occurrences counts

2004-12-23 Thread Daniel Naber
On Thursday 23 December 2004 07:50, [EMAIL PROTECTED] wrote:

> 1.ÂÂTo be able to return the number of times the word appears in all
> the documents (which it looks like lucene can do through IndexReader)

If you're referring to docFreq(Term t) , that will only return the number 
of documents that contain the term, ignoring how often the term occurs in 
these documents.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Word co-occurrences counts

2004-12-23 Thread Erik Hatcher
On Dec 23, 2004, at 1:50 AM, <[EMAIL PROTECTED]> wrote:
2.  To be able to return the number of word co-occurrences within
the document set (ie. How many times does "computer" appear within 50
words of  "dog")

Is the second point possible?
SpanNearQuery is your friend!  Like Paul said, this is not currently 
supported by QueryParser, however it is easy to do with the API.

Here's an example with a SpanOrQuery (a SpanNearQuery works 
identically) from the Lucene in Action code SpanQueryTest.java.  Two 
documents are indexed:

"the quick brown fox jumps over the lazy dog"
"the quick red fox jumps over the sleepy cat"
This SpanOrQuery is formed (omitting some code details):
SpanOrQuery or = new SpanOrQuery(new SpanQuery[]{quick, fox});
And the spans are displayed:
spanOr([f:quick, f:fox]):
   the  brown fox jumps over the lazy dog (0.37158427)
   the quick brown  jumps over the lazy dog (0.37158427)
   the  red fox jumps over the sleepy cat (0.37158427)
   the quick red  jumps over the sleepy cat (0.37158427)
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Word co-occurrences counts

2004-12-22 Thread Paul Elschot
On Thursday 23 December 2004 07:50, [EMAIL PROTECTED] wrote:
> Hi all,
> 
> I have a curious problem, and initial poking around with Lucene looks
> like it may only be able to half-handle the problem.
> 
>  
> 
> The problem requires two abilities:
> 
> 1.To be able to return the number of times the word appears in all
> the documents (which it looks like lucene can do through IndexReader) 
> 2.To be able to return the number of word co-occurrences within
> the document set (ie. How many times does "computer" appear within 50
> words of  "dog") 
>
>  
> 
> Is the second point possible?

You can use the standard query parser with a query like this:
"dog computer"~50
This query is not completely symmetric in the distance computation:
when computer occurs before dog, the allowed distance is 49, iirc.

There is also a SpanNearQuery for more generalized and flexible
distance queries, but this is not supported by the query parser,
so you'll have to construct these queries in your own program code.

In case you have non standard retrieval requirements, eg. you only
need the number of hits and no further information from the matching
documents, you may consider using your own HitCollector on the
lower level search methods.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Word co-occurrences counts

2004-12-22 Thread Andrew.Cunningham
Hi all,

I have a curious problem, and initial poking around with Lucene looks
like it may only be able to half-handle the problem.

 

The problem requires two abilities:

1.  To be able to return the number of times the word appears in all
the documents (which it looks like lucene can do through IndexReader) 
2.  To be able to return the number of word co-occurrences within
the document set (ie. How many times does "computer" appear within 50
words of  "dog") 

 

Is the second point possible?

 

Thanks all, and happy holidays,

Andrew