Re: statistics about word distances in solr

2009-06-09 Thread Michael Ludwig

Moin Jens,

Jens Fischer schrieb:

I was wondering if there's an option to return statistics about
distances from the query terms to the most frequent terms in the
result documents.



The additional information I'm looking for is the average distance
between these terms and my search term.

So let's say I have two docs

the house is red
I live in a red house

The search for house should also return the info

the:1
is:1
red:1.5
I:5
live:4


Could you explain what the distance here is? Something like edit
distance? Ah, I see: You want the textual distance between the search
term and other terms in the document, and then you want that averaged,
i.e. the cumulative distance divided by the number of occurrences.

No idea if that functionality is available.

However, the sort of calculation you want to perform requires the engine
to not only collect all the terms to present as facets (much improved in
1.4, as I've just learned), but to also analyze each document (if I'm
not mistaken) to determine the distance for each facet term from your
primary query term. (Or terms.)

The number of lookup operations is likely to scale as the product of
the number of your primary search results, the number of your search
terms, and the number of your facets.

I assume this is an expensive operation.

Michael Ludwig


statistics about word distances in solr

2009-06-04 Thread Jens Fischer
Hi,

 

I was wondering if there's an option to return statistics about distances
from the query terms to the most frequent terms in the result documents.

At present I return the most frequent terms using facetSearch which returns
for each word in the result documents the number ob occurences (within the
results).

The additional information I'm looking for is the average distance between
these terms and my search term.

 

So let's say I have two docs

the house is red

I live in a red house

The search for house should also return the info

the:1

is:1

red:1.5

I:5

live:4

and so on...

 

 

As I wasn't able to find such a function I thought about two solution for
the problem

 

1) Use facetSearch and implement a different facet.method which calculates
the average distance of a word to the given search term.

Solr doesn't seem to provide an interface to  implement a different method
so I think this solution would be a bit dogdy and would lead to problems
with the next Solr Upgrade.

 

2) Using the TermVectorComponent which return the position of each word
within a document, I could calculate the distance based on this data in the
application.

But TermVectorComponent returns information per document which means I would
need to return all documents of the result set which is probably not
recommended.

 

 

My question is

a) Did a miss a function of Solr that already does what I'm looking for?

 

b) Is solution 2) feasible even if I always have to return all docs of the
results set (the content doesn't need to be return though, just the
statistics)

 

c) Are the interfaces to ammend facetSearch the way I described which I
might have missed?

 

 

 

Thanks

Jens