Re: facet results in order of rank
Thanks for the reply Your thoughts are what I initially was thinking. But, given some more consideration, I imagined a system that would take all the docs that would be returned for a given facet, and get an average score based on their scores from the original search that produced the facets. This would be the facet values rank. So, a higher ranked facet value would be more likely to return higher ranked results. The idea is that if you want a broad loose search over a large dataset, and you order the results based on rank, so you get the most relevant results at the top, e.g. the first page in a search engine website. You might have pages and pages of results, but it's the first few pages of results that are highly ranked that most users generally see. As the relevance tapers off, then generally do another search. However, if you compute facet values on these results, you have no way of knowing if one facet value for a field is more or less likely to return higher scored, relevant records for the user. You end up getting facet values that match records that is often totally irrelevant. We can sort by Index order, or Count of docs returned. Would I would like is a sort based on Score, such that it would be sum(scores)/Count. I would assume that most users would be interested in the higher ranked ones more often. So, a more efficient UI could be built to show just the high ranked facets on this score, and provide a control to show all the facets (not just the high ranked ones.) Does this clear up my post at all? Perhaps this wouldn't be too hard for me to implement. I have lots of Java experience, but no experience with Lucene or Solr code. thoughts? thanks gene On Tue, Apr 28, 2009 at 10:56 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Fri, Apr 24, 2009 at 12:25 PM, ristretto.rb ristretto...@gmail.comwrote: Hello, Is it possible to order the facet results on some ranking score? I've had a look at the facet.sort param, ( http://wiki.apache.org/solr/SimpleFacetParameters#head-569f93fb24ec41b061e37c702203c99d8853d5f1 ) but that seems to order the facet either by count or by index value (in my case alphabetical.) Facets are not ranked because there is no criteria for determining relevancy for them. They are just the count of documents for each term in a given field computed for the current result set. We are facing a big number of facet results for multiple termed queries that are OR'ed together. We want to keep the OR nature of our queries, but, we want to know which facet values are likely to give you higher ranked results. We could AND together the terms, to get the facet list to be more manageable, but we would be filtering out too many results. We prefer to OR terms and let the ranking bring the good stuff to the top. For example, suppose we have a index of all known animals and each doc has a field AO for animal-origin. Suppose we search for: wolf grey forest Europe And generate facets AO. We might get the following facet results: For the AO field, lots of countries of the world probably have grey or forest or wolf or Europe in their indexing data, so I'm asserting we'd get a big list here. But, only some of the countries will have all 4 terms, and those are the facets that will be the most interesting to drill down on. Is there a way to figure out which facet is the most highly ranked like this? Suppose 10 documents match the query you described. If you facet on AO, then it would just go through all the terms in AO and give you the number of documents which have that term. There's no question of relevance at all here. The returned documents themselves are of course ranked according to the relevancy score. Perhaps I've misunderstood the query? -- Regards, Shalin Shekhar Mangar.
Re: how to find terms on a page?
That's excellent. Thanks for the reply. gene On Tue, Sep 23, 2008 at 6:39 AM, Chris Hostetter [EMAIL PROTECTED] wrote: : I haven't heard of or found a way to find the number of times a term : is found on a page. : Lucene uses it in scoring, I believe, (solr scoring: http://tinyurl.com/4tb55r) Assuming by page you mean document then the term frequency (tf) is factored into the score, but at a low enough level that it's not carried along iwth the score during a normal search. : Basically, for a given page, I would like : a list of terms on the page and number of times the terms appear on the page? work is currently being done however to make it possible for people to fetch some of the raw tf/idf info directly... https://issues.apache.org/jira/browse/SOLR-651 -Hoss
Re: How to set term frequency given a term and a value stating the frequency?
I decided to store the word X number of times when indexing the doc. times = 5 value = times * dog # dog dog dog dog dog gets indexed, of course times is specific to each doc. thanks for the help and advice Otis!! cheers gene On Thu, Sep 18, 2008 at 4:27 AM, Otis Gospodnetic [EMAIL PROTECTED] wrote: There are Lucene field term Paylods that can be associated with each token, which I think you could use for this type of boosting, but there is not much built-in support for Payloads in Solr yet. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: ristretto.rb [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Wednesday, September 17, 2008 5:24:20 AM Subject: How to set term frequency given a term and a value stating the frequency? Hello, I'm looking through the wiki, so if it's there, I'll find it, and you can ignore this post. If this isn't documented, can anyone explain how to achieve this? Suppose I have two docs A and B that I want to index. I want to index these documents so that A has the equivalent of 100 copies of 'Banana', and B has the equivalent of 20 copies of 'Banana', so that searches for Banana will rank A before B, due to term frequency. When indexing, I would have something like A Banana 100 B Banana 20. Will I have to repeat 'Banana' 100 times in a string variable that I send to the index? And likewise 20 times for B? Or is there a better way to accomplish this? thanks gene
Re: Filtering results
Thanks for the reply Erik Sorry for being vague. To be clear we have 1-2 million records, and rough 12000-14000 groups. Each record is in one and only one group. I see it working something like this 1. Identify all records that would match search terms. (Suppose I search for 'dog', and get 450,000 matches) 2. Of those records, find the distinct list of groups over all the matches. (Suppose there are 300.) 3. Now get the top ranked record from each group, as if you search just for docs in the group. Your response has me thinking this is a hard nut to crack. I'm wondering if there is a way to structure ranking to get us close on this one? thanks gene On Wed, Sep 17, 2008 at 8:39 AM, Erik Hatcher [EMAIL PROTECTED] wrote: Personally, I'd send three requests for solr, one for each group. rows=1fq=category:A ... and so on. But that'd depend on how many groups you have. One can always hack custom request handlers to do this sort of thing all as a single request, but I'd guess it ain't that much slower to just make 3 requests. And there are fancier solutions out there that might fit as well, like the field collapsing patch. Erik On Sep 16, 2008, at 4:13 PM, ristretto.rb wrote: Hello All, I'm looking for a way to filter results by some ranking mechanism. For example... Suppose you have 30 docs in an index, and they are in groups of 10, like this A, 1 A, 2 : A, 10 B, 1 B, 2 : B, 10 C, 1 C, 2 : C, 10 I would like to get 3 records back such that I get a single, best, result from each logical group. So, if I searched with a term that would match all the docs in the index, I could be certain to get a doc with A in it, one with B in it and one with C in it. The the moment, I have a solr index that has a category field, and the index will have between 1 and 2 million results when we are done indexing. I'm going to spend some time today researching this. If anyone can send me some advice, I would be grateful. I've considered post processing the results, but I'm not sure if this is the wisest plan. And, I don't know how I would accurate result counts, to do pagination. cheers
Re: How to copy a solr index to another index with a different schema collapsing stored data?
I was pretty sure you'd say that. But, I means lots that you take the time to confirm it. Thanks Otis. I don't want to give details, but we crawl for our data, and we don't save it in a DB or on disk. It goes from download to index. Was a good idea at the time; when we thought our designs were done evolving. :) cheers gene On Wed, Sep 17, 2008 at 12:51 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: You can't copy+merge+flatten indices like that. Reindexing would be the easiest. Indexing taking weeks sounds suspicious. How much data are you reindexing and how big are your indices? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: ristretto.rb [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Tuesday, September 16, 2008 8:14:16 PM Subject: How to copy a solr index to another index with a different schema collapsing stored data? Is it possible to copy stored index data from index to another, but concatenating it as you go. Suppose 2 categories A and B both with 20 docs, for a total of 40 docs in the index. The index has a stored field for the content from the docs. I want a new index with only two docs in it, one for A and one for B. And it would have a stored field that is the sum of all the stored data for the 20 docs of A and of B respectively. So, then a query on this index will tell me give me a relevant list of Categories? Perhaps there's a solr query to get that data out, and then I can handle concatenating it, and then indexing it in the new index. I'm hoping I don't have to reindex all this data from scratch? It has taken weeks! thanks gene
Re: Filtering results
OK thanks Otis. Any gut feeling on the best approach to get this collapsed data? I hate to ask you to do my homework, but I'm coming to the end of my Solr/Lucene knowledge. I don't code java too well - used to, but switched to Python a while back. gene On Wed, Sep 17, 2008 at 12:47 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Gene, The latest patch from Bojan for SOLR-236 works with whatever revision of Solr he used when he made the patch. I didn't follow this thread to know your original requirements, but running 1+10 queries doesn't sound good to me from scalability/performance point of view. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: ristretto.rb [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Tuesday, September 16, 2008 6:45:02 PM Subject: Re: Filtering results thanks. very interesting. The plot thickens. And, yes, I think field collapsing is exactly what I'm after. I'm am considering now trying this patch. I have a solr 1.2 instance on Jetty. I looks like I need to install the patch. Does anyone use that patch? Recommend it? The wiki page (http://wiki.apache.org/solr/FieldCollapsing) says This patch is not complete, but it will be useful to keep this page updated while the interface evolves. And the page was last updated over a year ago, so I'm not sure if that is a good. I'm trying to read through all the comments now. . I'm also considering creating a second index of just the categories which contains all the content from the main index collapsed down in to the corresponding categories - basically a complete collapsed index. Initial searches will be done against this collapsed category index, and then the first 10 results will be used to do 10 field queries against the main index to get the top records to return with each Category. Haven't decided which path to take yet. cheers gene On Wed, Sep 17, 2008 at 9:42 AM, Chris Hostetter wrote: : 1. Identify all records that would match search terms. (Suppose I : search for 'dog', and get 450,000 matches) : 2. Of those records, find the distinct list of groups over all the : matches. (Suppose there are 300.) : 3. Now get the top ranked record from each group, as if you search : just for docs in the group. this sounds similar to Field Collapsing although i don't really understand it or your specific use case enough to be certain that it's the same thing. You may find the patch, and/or the discussions about the patch useful starting points... https://issues.apache.org/jira/browse/SOLR-236 http://wiki.apache.org/solr/FieldCollapsing -Hoss
Re: Clarification on facets
Thank you for the response. Always nice to have something willing to validate your thinking! Of course, if anyone has any ideas on how to get the numbers of times term is repeated in a document, I'm all ears. cheers gene On Tue, Aug 19, 2008 at 1:42 PM, Norberto Meijome [EMAIL PROTECTED] wrote: On Tue, 19 Aug 2008 10:18:12 +1200 Gene Campbell [EMAIL PROTECTED] wrote: Is this interpreted as meaning, there are 10 documents that will match with 'car' in the title, and likewise 6 'boat' and 2 'bike'? Correct. If so, is there any way to get counts for the *number times* a value is found in a document. I'm looking for a way to determine the number of times 'car' is repeated in the title, for example Not sure - i would suggest that a field with a term repeated several times would receive a higher score when searching for that term, but not sure how you could get the information you seek...maybe with the Luke handler ? ( but on a per-document basis...slow... ? ) B _ {Beto|Norberto|Numard} Meijome Computers are like air conditioners; they can't do their job properly if you open windows. I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Help with word frequency / tag clouds
Hello Solrites, I'm somewhat new to Solr and Lucene. I would like to build a tag cloud based on a filtered set of words from documents. I have a master list of approved tags. So, what I need from each document is the list of words and frequencies such that that words appear in the master list (filtered.) Then, I should be able to build a tag cloud UI (in html/css) Is this something I have to build? If so, I'm guessing I would need to do it during indexing, but how? Perhaps I need an Analyzer or Tokenizer that can give me counts of words, and the let me filter and store in a DB, or back in the index. Can anyone shed some advice? thanks gene
Re: Help with word frequency / tag clouds
OK Thanks. I will have another look at faceting again. gene On Mon, Aug 18, 2008 at 3:48 AM, Shalin Shekhar Mangar [EMAIL PROTECTED] wrote: Hi Gene, Solr supports this (faceted search) out-of-the-box. Take a look at: http://wiki.apache.org/solr/SimpleFacetParameters http://wiki.apache.org/solr/SolrFacetingOverview It will give you words with their frequencies for the fields you select. However, it will give you all the facets (tags) and your front-end must do the filtering with the master list. On Sun, Aug 17, 2008 at 11:43 AM, Gene Campbell [EMAIL PROTECTED] wrote: Hello Solrites, I'm somewhat new to Solr and Lucene. I would like to build a tag cloud based on a filtered set of words from documents. I have a master list of approved tags. So, what I need from each document is the list of words and frequencies such that that words appear in the master list (filtered.) Then, I should be able to build a tag cloud UI (in html/css) Is this something I have to build? If so, I'm guessing I would need to do it during indexing, but how? Perhaps I need an Analyzer or Tokenizer that can give me counts of words, and the let me filter and store in a DB, or back in the index. Can anyone shed some advice? thanks gene -- Regards, Shalin Shekhar Mangar.
Can facet numbers be constrained to one result doc or a group of result docs?
I'm still learning how to use facets with Solr correctly. It seems that you get facet counts computed over all docs in your index. For example, I tried this on a local index I've built up for testing. This index has urls uniquely indexed, so no two docs have the same url value. http://localhost:8085/solr/select?q=www.example.comfacet=truefacet.field=titlefacet.limit=-1facet.mincount=1 This returns what seems like facet values for the title field over my whole index. What I would like is the the facet value counts computed just over the docs returned. Is this possible? thanks gene
Re: Can facet numbers be constrained to one result doc or a group of result docs?
OK, more testing seems to say that if i do mincount=1, I only get facet field values that are actually in the documents 'facet_fields':{ 'title':[ 'build',1, 'central',1, : : for http://localhost:8085/solr/select?q=example.comwt=pythonindent=onfacet=truefacet.field=titlefacet.limit=-1facet.sort=truefacet.mincount=1 assuming title is a facetable field. please correct if I'm on the wrong track. cheers gene On Mon, Aug 18, 2008 at 5:10 PM, Gene Campbell [EMAIL PROTECTED] wrote: I'm still learning how to use facets with Solr correctly. It seems that you get facet counts computed over all docs in your index. For example, I tried this on a local index I've built up for testing. This index has urls uniquely indexed, so no two docs have the same url value. http://localhost:8085/solr/select?q=www.example.comfacet=truefacet.field=titlefacet.limit=-1facet.mincount=1 This returns what seems like facet values for the title field over my whole index. What I would like is the the facet value counts computed just over the docs returned. Is this possible? thanks gene