Re: facet results in order of rank

2009-04-27 Thread Gene Campbell
Thanks for the reply

Your thoughts are what I initially was thinking.  But, given some more
consideration, I imagined a system that would take all the docs that
would be returned for a given facet, and get an average score based on
their scores from the original search that produced the facets.  This
would be the facet values rank.  So, a higher ranked facet value would
be more likely to return higher ranked results.

The idea is that if you want a broad loose search over a large
dataset, and you order the results based on rank, so you get the most
relevant results at the top, e.g. the first page in a search engine
website.  You might have pages and pages of results, but it's the
first few pages of results that are highly ranked that most users
generally see.  As the relevance tapers off, then generally do another
search.

However, if you compute facet values on these results, you have no way
of knowing if one facet value for a field is more or less likely to
return higher scored, relevant records for the user.  You end up
getting facet values that match records that is often totally
irrelevant.

We can sort by Index order, or Count of docs returned.  Would I would
like is a sort based on Score, such that it would be
sum(scores)/Count.

I would assume that most users would be interested in the higher
ranked ones more often.  So, a more efficient UI could be built to
show just the high ranked facets on this score, and provide a control
to show all the facets (not just the high ranked ones.)

Does this clear up my post at all?

Perhaps this wouldn't be too hard for me to implement.  I have lots of
Java experience, but no experience with Lucene or Solr code.
thoughts?

thanks
gene




On Tue, Apr 28, 2009 at 10:56 AM, Shalin Shekhar Mangar
shalinman...@gmail.com wrote:
 On Fri, Apr 24, 2009 at 12:25 PM, ristretto.rb ristretto...@gmail.comwrote:

 Hello,

 Is it possible to order the facet results on some ranking score?
 I've had a look at the facet.sort param,
 (
 http://wiki.apache.org/solr/SimpleFacetParameters#head-569f93fb24ec41b061e37c702203c99d8853d5f1
 )
 but that seems to order the facet either by count or by index value
 (in my case alphabetical.)


 Facets are not ranked because there is no criteria for determining relevancy
 for them. They are just the count of documents for each term in a given
 field computed for the current result set.



 We are facing a big number of facet results for multiple termed
 queries that are OR'ed together.  We want to keep the OR nature of our
 queries,
 but, we want to know which facet values are likely to give you higher
 ranked results.  We could AND together the terms, to get the facet
 list to be
 more manageable, but we would be filtering out too many results.  We
 prefer to OR terms and let the ranking bring the good stuff to the
 top.

 For example, suppose we have a index of all known animals and
 each doc has a field AO for animal-origin.

 Suppose we search for:  wolf grey forest Europe
 And generate facets AO.  We might get the following
 facet results:

 For the AO field, lots of countries of the world probably have grey or
 forest or wolf or Europe in their indexing data, so I'm asserting we'd
 get a big list here.
 But, only some of the countries will have all 4 terms, and those are
 the facets that will be the most interesting to drill down on.  Is
 there
 a way to figure out which facet is the most highly ranked like this?


 Suppose 10 documents match the query you described. If you facet on AO, then
 it would just go through all the terms in AO and give you the number of
 documents which have that term. There's no question of relevance at all
 here. The returned documents themselves are of course ranked according to
 the relevancy score.

 Perhaps I've misunderstood the query?

 --
 Regards,
 Shalin Shekhar Mangar.



Re: how to find terms on a page?

2008-09-22 Thread Gene Campbell
That's excellent.  Thanks for the reply.

gene


On Tue, Sep 23, 2008 at 6:39 AM, Chris Hostetter
[EMAIL PROTECTED] wrote:

 : I haven't heard of or found a way to find the number of times a term
 : is found on a page.
 : Lucene uses it in scoring, I believe, (solr scoring:  
 http://tinyurl.com/4tb55r)

 Assuming by page you mean document then the term frequency (tf) is
 factored into the score, but at a low enough level that it's not carried
 along iwth the score during a normal search.

 : Basically, for a given page, I would like
 : a list of terms on the page and number of times the terms appear on the 
 page?

 work is currently being done however to make it possible for people to
 fetch some of the raw tf/idf info directly...

 https://issues.apache.org/jira/browse/SOLR-651


 -Hoss




Re: How to set term frequency given a term and a value stating the frequency?

2008-09-17 Thread Gene Campbell
I decided to store the word X number of times when indexing the doc.

times = 5
value = times * dog   # dog dog dog dog dog  gets indexed, of
course times is specific to each doc.

thanks for the help and advice Otis!!

cheers
gene


On Thu, Sep 18, 2008 at 4:27 AM, Otis Gospodnetic
[EMAIL PROTECTED] wrote:
 There are Lucene field term Paylods that can be associated with each token, 
 which I think you could use for this type of boosting, but there is not much 
 built-in support for Payloads in Solr yet.


 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



 - Original Message 
 From: ristretto.rb [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Sent: Wednesday, September 17, 2008 5:24:20 AM
 Subject: How to set term frequency given a term and a value stating the 
 frequency?

 Hello,

 I'm looking through the wiki, so if it's there, I'll find it, and you
 can ignore this post.
 If this isn't documented, can anyone explain how to achieve this?

 Suppose I have two docs A and B that I want to index.  I want to index
 these documents
 so that A has the equivalent of 100 copies of 'Banana', and B has the
 equivalent of 20 copies of
 'Banana', so that searches for Banana will rank A before B, due to
 term frequency.

 When indexing, I would have something like

 A Banana 100
 B Banana 20.

 Will I have to repeat 'Banana' 100 times in a string variable that I
 send to the index?   And likewise 20 times for B?
 Or is there a better way to accomplish this?

 thanks
 gene




Re: Filtering results

2008-09-16 Thread Gene Campbell
Thanks for the reply Erik

Sorry for being vague.  To be clear we have 1-2 million records, and
rough 12000-14000 groups.
Each record is in one and only one group.

I see it working something like this

1.  Identify all records that would match search terms.  (Suppose I
search for 'dog', and get 450,000 matches)
2.  Of those records, find the distinct list of groups over all the
matches.  (Suppose there are 300.)
3.  Now get the top ranked record from each group, as if you search
just for docs in the group.

Your response has me thinking this is a hard nut to crack.  I'm
wondering if there is a way to structure ranking to get us close on
this one?

thanks
gene




On Wed, Sep 17, 2008 at 8:39 AM, Erik Hatcher
[EMAIL PROTECTED] wrote:
 Personally, I'd send three requests for solr, one for each group.
  rows=1fq=category:A ... and so on.

 But that'd depend on how many groups you have.

 One can always hack custom request handlers to do this sort of thing all as
 a single request, but I'd guess it ain't that much slower to just make 3
 requests.  And there are fancier solutions out there that might fit as well,
 like the field collapsing patch.

Erik

 On Sep 16, 2008, at 4:13 PM, ristretto.rb wrote:

 Hello All,

 I'm looking for a way to filter results by some ranking mechanism.
 For example...

 Suppose you have 30 docs in an index, and they are in groups of 10, like
 this

 A, 1
 A, 2
 :
 A, 10

 B, 1
 B, 2
 :
 B, 10

 C, 1
 C, 2
 :
 C, 10

 I would like to get 3 records back such that I get a single,  best,
 result from each logical group.
 So, if I searched with a term that would match all the docs in the
 index, I could be certain to get
 a doc with A in it, one with B in it and one with C in it.

 The the moment, I have a solr index that has a category field, and the
 index will have between 1 and 2 million results
 when we are done indexing.

 I'm going to spend some time today researching this.  If anyone can
 send me some advice, I would be grateful.

 I've considered post processing the results, but I'm not sure if this
 is the wisest plan.  And, I don't know how I would accurate
 result counts, to do pagination.

 cheers




Re: How to copy a solr index to another index with a different schema collapsing stored data?

2008-09-16 Thread Gene Campbell
I was pretty sure you'd say that.  But, I means lots that you take the
time to confirm it.  Thanks Otis.

I don't want to give details, but we crawl for our data, and we don't
save it in a DB or on disk.  It goes from download to index.  Was a
good idea at the time; when we thought our designs were done evolving.
 :)

cheers
gene


On Wed, Sep 17, 2008 at 12:51 PM, Otis Gospodnetic
[EMAIL PROTECTED] wrote:
 You can't copy+merge+flatten indices like that.  Reindexing would be the 
 easiest.  Indexing taking weeks sounds suspicious.  How much data are you 
 reindexing and how big are your indices?

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



 - Original Message 
 From: ristretto.rb [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Sent: Tuesday, September 16, 2008 8:14:16 PM
 Subject: How to copy a solr index to another index with a different schema 
 collapsing stored data?

 Is it possible to copy stored index data from index to another, but
 concatenating it as you go.

 Suppose 2 categories A and B both with 20 docs, for a total of 40 docs
 in the index.  The index has a stored field for the content from the
 docs.

 I want a new index with only two docs in it, one for A and one for B.
 And it would have a stored field that is the sum of all the stored
 data for the 20 docs of A and of B respectively.

 So, then a query on this index will tell me give me a relevant list of
 Categories?

 Perhaps there's a solr query to get that data out, and then I can
 handle concatenating it, and then indexing it in the new index.

 I'm hoping I don't have to reindex all this data from scratch?  It has
 taken weeks!

 thanks
 gene




Re: Filtering results

2008-09-16 Thread Gene Campbell
OK thanks Otis.  Any gut feeling on the best approach to get this
collapsed data?  I hate to ask you to do my homework, but I'm coming
to the
end of my Solr/Lucene knowledge.  I don't code java too well - used
to, but switched to Python a while back.

gene




On Wed, Sep 17, 2008 at 12:47 PM, Otis Gospodnetic
[EMAIL PROTECTED] wrote:
 Gene,

 The latest patch from Bojan for SOLR-236 works with whatever revision of Solr 
 he used when he made the patch.

 I didn't follow this thread to know your original requirements, but running 
 1+10 queries doesn't sound good to me from scalability/performance point of 
 view.

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



 - Original Message 
 From: ristretto.rb [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Sent: Tuesday, September 16, 2008 6:45:02 PM
 Subject: Re: Filtering results

 thanks.  very interesting.  The plot thickens.  And, yes, I think
 field collapsing is exactly what I'm after.

 I'm am considering now trying this patch.  I have a solr 1.2 instance
 on Jetty.  I looks like I need to install the patch.
 Does anyone use that patch?  Recommend it?  The wiki page
 (http://wiki.apache.org/solr/FieldCollapsing) says
 This patch is not complete, but it will be useful to keep this page
 updated while the interface evolves.  And the page
 was last updated over a year ago, so I'm not sure if that is a good.
 I'm trying to read through all the comments now.

 .  I'm also considering creating a second index of just the
 categories which contains all the content from the main index
 collapsed
 down in to the corresponding categories - basically a complete
 collapsed index.
 Initial searches will be done against this collapsed category index,
 and then the first 10 results
 will be used to do 10 field queries against the main index to get the
 top records to return with each Category.

 Haven't decided which path to take yet.

 cheers
 gene


 On Wed, Sep 17, 2008 at 9:42 AM, Chris Hostetter
 wrote:
 
  : 1.  Identify all records that would match search terms.  (Suppose I
  : search for 'dog', and get 450,000 matches)
  : 2.  Of those records, find the distinct list of groups over all the
  : matches.  (Suppose there are 300.)
  : 3.  Now get the top ranked record from each group, as if you search
  : just for docs in the group.
 
  this sounds similar to Field Collapsing although i don't really
  understand it or your specific use case enough to be certain that it's the
  same thing.  You may find the patch, and/or the discussions about the
  patch useful starting points...
 
  https://issues.apache.org/jira/browse/SOLR-236
  http://wiki.apache.org/solr/FieldCollapsing
 
 
  -Hoss
 
 




Re: Clarification on facets

2008-08-18 Thread Gene Campbell
Thank you for the response.  Always nice to have something willing to
validate your thinking!

Of course, if anyone has any ideas on how to get the numbers of times
term is repeated in a document,
I'm all ears.

cheers
gene


On Tue, Aug 19, 2008 at 1:42 PM, Norberto Meijome [EMAIL PROTECTED] wrote:
 On Tue, 19 Aug 2008 10:18:12 +1200
 Gene Campbell [EMAIL PROTECTED] wrote:

 Is this interpreted as meaning, there are 10 documents that will match
 with 'car' in the title, and likewise 6 'boat' and 2 'bike'?

 Correct.

 If so, is there any way to get counts for the *number times* a value
 is found in a document.  I'm looking for a way to determine the number
 of times 'car' is repeated in the title, for example

 Not sure - i would suggest that a field with a term repeated several times 
 would receive a higher score when searching for that term, but not sure how 
 you could get the information you seek...maybe with the Luke handler ? ( but 
 on a per-document basis...slow... ? )

 B
 _
 {Beto|Norberto|Numard} Meijome

 Computers are like air conditioners; they can't do their job properly if you 
 open windows.

 I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
 Reading disclaimers makes you go blind. Writing them is worse. You have been 
 Warned.



Help with word frequency / tag clouds

2008-08-17 Thread Gene Campbell
Hello Solrites,

I'm somewhat new to Solr and Lucene.  I would like to build a tag
cloud based on a filtered set of words from documents.  I have a
master list of approved tags.  So, what I need from each document is
the list of words and frequencies such that that words appear in the
master list (filtered.)  Then, I should be able to build a tag cloud
UI (in html/css)

Is this something I have to build?  If so, I'm guessing I would need
to do it during indexing, but how?  Perhaps I need an Analyzer or
Tokenizer that can give me counts of words, and the let me filter and
store in a DB, or back in the index.

Can anyone shed some advice?

thanks
gene


Re: Help with word frequency / tag clouds

2008-08-17 Thread Gene Campbell
OK Thanks.   I will have another look at faceting again.

gene



On Mon, Aug 18, 2008 at 3:48 AM, Shalin Shekhar Mangar
[EMAIL PROTECTED] wrote:
 Hi Gene,

 Solr supports this (faceted search) out-of-the-box.

 Take a look at:
 http://wiki.apache.org/solr/SimpleFacetParameters
 http://wiki.apache.org/solr/SolrFacetingOverview

 It will give you words with their frequencies for the fields you select.
 However, it will give you all the facets (tags) and your front-end must do
 the filtering with the master list.

 On Sun, Aug 17, 2008 at 11:43 AM, Gene Campbell [EMAIL PROTECTED] wrote:

 Hello Solrites,

 I'm somewhat new to Solr and Lucene.  I would like to build a tag
 cloud based on a filtered set of words from documents.  I have a
 master list of approved tags.  So, what I need from each document is
 the list of words and frequencies such that that words appear in the
 master list (filtered.)  Then, I should be able to build a tag cloud
 UI (in html/css)

 Is this something I have to build?  If so, I'm guessing I would need
 to do it during indexing, but how?  Perhaps I need an Analyzer or
 Tokenizer that can give me counts of words, and the let me filter and
 store in a DB, or back in the index.

 Can anyone shed some advice?

 thanks
 gene




 --
 Regards,
 Shalin Shekhar Mangar.



Can facet numbers be constrained to one result doc or a group of result docs?

2008-08-17 Thread Gene Campbell
I'm still learning how to use facets with Solr correctly.  It seems
that you get facet counts computed over all docs in your index.
For example, I tried this on a local index I've built up for testing.
This index has urls uniquely indexed, so no two docs
have the same url value.

http://localhost:8085/solr/select?q=www.example.comfacet=truefacet.field=titlefacet.limit=-1facet.mincount=1

This returns what seems like facet values for the title field over my
whole index.  What I would like is the the facet value counts
computed just over the docs returned.  Is this possible?

thanks
gene


Re: Can facet numbers be constrained to one result doc or a group of result docs?

2008-08-17 Thread Gene Campbell
OK, more testing seems to say that if i do mincount=1, I only get
facet field values that are actually in the documents

'facet_fields':{
'title':[
 'build',1,
 'central',1,
:
:

for

http://localhost:8085/solr/select?q=example.comwt=pythonindent=onfacet=truefacet.field=titlefacet.limit=-1facet.sort=truefacet.mincount=1

assuming title is a facetable field.

please correct if I'm on the wrong track.

cheers
gene




On Mon, Aug 18, 2008 at 5:10 PM, Gene Campbell [EMAIL PROTECTED] wrote:
 I'm still learning how to use facets with Solr correctly.  It seems
 that you get facet counts computed over all docs in your index.
 For example, I tried this on a local index I've built up for testing.
 This index has urls uniquely indexed, so no two docs
 have the same url value.

 http://localhost:8085/solr/select?q=www.example.comfacet=truefacet.field=titlefacet.limit=-1facet.mincount=1

 This returns what seems like facet values for the title field over my
 whole index.  What I would like is the the facet value counts
 computed just over the docs returned.  Is this possible?

 thanks
 gene