Re: facet results in order of rank
Thanks for the reply. Hopefully I'll get more, and turn this into a mini project I can commit back to the project, or at least make available to anyone who'd like the functionality.Of course, if I'm the only one who cares, it could be a long road. :) gene On Fri, May 1, 2009 at 9:41 AM, Ensdorf Ken wrote: >> Hello Solrites (or Solrorians) > > I prefer "Solrdier" :) > >> >> Is it possible to get the average ranking score for a set of docs that >> would be returned for a given facet value. >> >> If not in SOLR, what about Lucene? >> >> How hard to implement? >> >> I have years of Java experience, but no Lucene coding experience. >> >> Would be happy to implement if someone could guide me. >> >> thanks >> Gene >> > > I don't know much about the implementation, but it seems to me it should be > possible to sum up the scores as the matching facet terms are gathered and > counted. According to the docs there are 2 algorithms that do this - one > enumerates all the unique values of the facet field and does an intersetion > with the query, and the other scans the result set and sums up the unique > values in the facet field for each doc. I would start by looking at the > source for the FacetComponent (org.apache.solr.handler.component) and > SimpleFacets (org.apache.solr.request) classes. > > Sorry I can't be of more help - it seems like an interesting challenge! > > Onward... > -Ken >
Re: facet results in order of rank
Hello Solrites (or Solrorians) Is it possible to get the average ranking score for a set of docs that would be returned for a given facet value. If not in SOLR, what about Lucene? How hard to implement? I have years of Java experience, but no Lucene coding experience. Would be happy to implement if someone could guide me. thanks Gene On Tue, Apr 28, 2009 at 11:39 AM, Gene Campbell wrote: > Thanks for the reply > > Your thoughts are what I initially was thinking. But, given some more > consideration, I imagined a system that would take all the docs that > would be returned for a given facet, and get an average score based on > their scores from the original search that produced the facets. This > would be the facet values rank. So, a higher ranked facet value would > be more likely to return higher ranked results. > > The idea is that if you want a broad loose search over a large > dataset, and you order the results based on rank, so you get the most > relevant results at the top, e.g. the first page in a search engine > website. You might have pages and pages of results, but it's the > first few pages of results that are highly ranked that most users > generally see. As the relevance tapers off, then generally do another > search. > > However, if you compute facet values on these results, you have no way > of knowing if one facet value for a field is more or less likely to > return higher scored, relevant records for the user. You end up > getting facet values that match records that is often totally > irrelevant. > > We can sort by Index order, or Count of docs returned. Would I would > like is a sort based on Score, such that it would be > sum(scores)/Count. > > I would assume that most users would be interested in the higher > ranked ones more often. So, a more efficient UI could be built to > show just the high ranked facets on this score, and provide a control > to show all the facets (not just the high ranked ones.) > > Does this clear up my post at all? > > Perhaps this wouldn't be too hard for me to implement. I have lots of > Java experience, but no experience with Lucene or Solr code. > thoughts? > > thanks > gene > > > > > On Tue, Apr 28, 2009 at 10:56 AM, Shalin Shekhar Mangar > wrote: >> On Fri, Apr 24, 2009 at 12:25 PM, ristretto.rb wrote: >> >>> Hello, >>> >>> Is it possible to order the facet results on some ranking score? >>> I've had a look at the facet.sort param, >>> ( >>> http://wiki.apache.org/solr/SimpleFacetParameters#head-569f93fb24ec41b061e37c702203c99d8853d5f1 >>> ) >>> but that seems to order the facet either by count or by index value >>> (in my case alphabetical.) >>> >> >> Facets are not ranked because there is no criteria for determining relevancy >> for them. They are just the count of documents for each term in a given >> field computed for the current result set. >> >> >>> >>> We are facing a big number of facet results for multiple termed >>> queries that are OR'ed together. We want to keep the OR nature of our >>> queries, >>> but, we want to know which facet values are likely to give you higher >>> ranked results. We could AND together the terms, to get the facet >>> list to be >>> more manageable, but we would be filtering out too many results. We >>> prefer to OR terms and let the ranking bring the good stuff to the >>> top. >>> >>> For example, suppose we have a index of all known animals and >>> each doc has a field AO for animal-origin. >>> >>> Suppose we search for: wolf grey forest Europe >>> And generate facets AO. We might get the following >>> facet results: >>> >>> For the AO field, lots of countries of the world probably have grey or >>> forest or wolf or Europe in their indexing data, so I'm asserting we'd >>> get a big list here. >>> But, only some of the countries will have all 4 terms, and those are >>> the facets that will be the most interesting to drill down on. Is >>> there >>> a way to figure out which facet is the most highly ranked like this? >>> >> >> Suppose 10 documents match the query you described. If you facet on AO, then >> it would just go through all the terms in AO and give you the number of >> documents which have that term. There's no question of relevance at all >> here. The returned documents themselves are of course ranked according to >> the relevancy score. >> >> Perhaps I've misunderstood the query? >> >> -- >> Regards, >> Shalin Shekhar Mangar. >> >
Re: facet results in order of rank
BUMP. After waiting a bit for a comment on this, I'm assuming there's no support for this type of feature. So, we are pushing on with a completely different implementation. Unfortunately, we haven't the time for the expertise to consider implementing it ourselves. gene On Fri, Apr 24, 2009 at 6:55 PM, ristretto.rb wrote: > Hello, > > Is it possible to order the facet results on some ranking score? > I've had a look at the facet.sort param, > (http://wiki.apache.org/solr/SimpleFacetParameters#head-569f93fb24ec41b061e37c702203c99d8853d5f1) > but that seems to order the facet either by count or by index value > (in my case alphabetical.) > > We are facing a big number of facet results for multiple termed > queries that are OR'ed together. We want to keep the OR nature of our > queries, > but, we want to know which facet values are likely to give you higher > ranked results. We could AND together the terms, to get the facet > list to be > more manageable, but we would be filtering out too many results. We > prefer to OR terms and let the ranking bring the good stuff to the > top. > > For example, suppose we have a index of all known animals and > each doc has a field AO for animal-origin. > > Suppose we search for: wolf grey forest Europe > And generate facets AO. We might get the following > facet results: > > For the AO field, lots of countries of the world probably have grey or > forest or wolf or Europe in their indexing data, so I'm asserting we'd > get a big list here. > But, only some of the countries will have all 4 terms, and those are > the facets that will be the most interesting to drill down on. Is > there > a way to figure out which facet is the most highly ranked like this? > > This is a contrived example, not part of any real project I know > about. Just trying to get my point across. > > thanks > Gene > > Gene Campbell > Picante Solutions Limited >
facet results in order of rank
Hello, Is it possible to order the facet results on some ranking score? I've had a look at the facet.sort param, (http://wiki.apache.org/solr/SimpleFacetParameters#head-569f93fb24ec41b061e37c702203c99d8853d5f1) but that seems to order the facet either by count or by index value (in my case alphabetical.) We are facing a big number of facet results for multiple termed queries that are OR'ed together. We want to keep the OR nature of our queries, but, we want to know which facet values are likely to give you higher ranked results. We could AND together the terms, to get the facet list to be more manageable, but we would be filtering out too many results. We prefer to OR terms and let the ranking bring the good stuff to the top. For example, suppose we have a index of all known animals and each doc has a field AO for animal-origin. Suppose we search for: wolf grey forest Europe And generate facets AO. We might get the following facet results: For the AO field, lots of countries of the world probably have grey or forest or wolf or Europe in their indexing data, so I'm asserting we'd get a big list here. But, only some of the countries will have all 4 terms, and those are the facets that will be the most interesting to drill down on. Is there a way to figure out which facet is the most highly ranked like this? This is a contrived example, not part of any real project I know about. Just trying to get my point across. thanks Gene Gene Campbell Picante Solutions Limited
Re: Seattle / PNW Hadoop + Lucene User Group?
Beer h, I'm in New Zealand, so probably can't make it, but I sounds tempting. cheers gene On Tue, Apr 21, 2009 at 11:28 AM, Bradford Stephens wrote: > Thanks for the responses, everyone. Where shall we host? My company > can offer space in our building in Factoria, but it's not exactly a > 'cool' or 'fun' place. I can also reserve a room at a local library. I > can bring some beer and light refreshments. > > On Mon, Apr 20, 2009 at 7:22 AM, Matthew Hall > wrote: >> Same here, sadly there isn't much call for Lucene user groups in Maine. It >> would be nice though ^^ >> >> Matt >> >> Amin Mohammed-Coleman wrote: >>> >>> I would love to come but I'm afraid I'm stuck in rainy old England :( >>> >>> Amin >>> >>> On 18 Apr 2009, at 01:08, Bradford Stephens >>> wrote: >>> OK, we've got 3 people... that's enough for a party? :) Surely there must be dozens more of you guys out there... c'mon, accelerate your knowledge! Join us in Seattle! On Thu, Apr 16, 2009 at 3:27 PM, Bradford Stephens wrote: > > Greetings, > > Would anybody be willing to join a PNW Hadoop and/or Lucene User Group > with me in the Seattle area? I can donate some facilities, etc. -- I > also always have topics to speak about :) > > Cheers, > Bradford > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> - >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >> >> >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >
Fwd: Advice on moving from 1.3 to 1.4-dev or trunk?
I have built the trunk code as of Revision: 765826 and tried !tag=/!ex= which is what I need to work. And IT WORKS! That's great. Now, is it unwise to release 1.4 into production for this feature (based on my explanation below)? thanks gene -- Forwarded message -- From: ristretto.rb Date: Fri, Apr 17, 2009 at 11:16 AM Subject: Advice on moving from 1.3 to 1.4-dev or trunk? To: solr-user@lucene.apache.org Hello, I'm using solr 1.3 with solr.py. We have a basic schema.xml, nothing custom or out of the ordinary. I need the following the feature from http://svn.apache.org/repos/asf/lucene/solr/trunk/CHANGES.txt SOLR-911: Add support for multi-select faceting by allowing filters to be tagged and facet commands to exclude certain filters. This patch also added the ability to change the output key for facets in the response, and optimized distributed faceting refinement by lowering parsing overhead and by making requests and responses smaller. Since this requires 1.4, looks like I have to upgrade (or roll my own solution that this feature provides.) I'm looking for a bit of advice. I have looked through the bugs here http://issues.apache.org/jira/browse/SOLR/fixforversion/12313351 1. I would need to get the source for 1.4 and build it, right? No release yet, eh? 2. Any one using 1.4 in production without issue; is this wise? Or should I wait? 3. Will I need to make changes to my schema.xml to support my current field set under 1.4? 4. Do I need to reindex all my data? thanks gene
Advice on moving from 1.3 to 1.4-dev or trunk?
Hello, I'm using solr 1.3 with solr.py. We have a basic schema.xml, nothing custom or out of the ordinary. I need the following the feature from http://svn.apache.org/repos/asf/lucene/solr/trunk/CHANGES.txt SOLR-911: Add support for multi-select faceting by allowing filters to be tagged and facet commands to exclude certain filters. This patch also added the ability to change the output key for facets in the response, and optimized distributed faceting refinement by lowering parsing overhead and by making requests and responses smaller. Since this requires 1.4, looks like I have to upgrade (or roll my own solution that this feature provides.) I'm looking for a bit of advice. I have looked through the bugs here http://issues.apache.org/jira/browse/SOLR/fixforversion/12313351 1. I would need to get the source for 1.4 and build it, right? No release yet, eh? 2. Any one using 1.4 in production without issue; is this wise? Or should I wait? 3. Will I need to make changes to my schema.xml to support my current field set under 1.4? 4. Do I need to reindex all my data? thanks gene
Anyone use solr admin and Opera?
Hello, I'm a happy Solr user. Thanks for the excellent software!! Hopefully this is a good question, I have indeed looked around the FAQ and google and such first. I have just switched from Firefox to Opera for web browsing. (Another story) When I use the solr/admin the home page and stats works fine, but searches return unformatted results all run together. If I get source, I see it is XML, and in fact, the source is more readable then page itself. Perhaps I need a stylesheet, or something. Are there there any other Opera users that have gotten past this problem. Thanks gene
Re: unique result
FWIW... We run a hash or the content and other bits of our docs, and then remove duplicates according to specific algorithms. (exactly the same page content can clearly be hosted on many different urls but, and domains) Then, the choosen ones are indexed. Though we toss the synonyms in the index too, so we know all it's other "names." cheers gene Gene Campbell http:www.picante.co.nz gene at picante point co point nz http://www.travelbeen.com - "the social search engine for travel" On Fri, Feb 27, 2009 at 5:53 AM, Cheng Zhang wrote: > It's exactly what I'm looking for. Thank you Grant. > > > - Original Message > From: Grant Ingersoll > To: solr-user@lucene.apache.org > Sent: Thursday, February 26, 2009 6:56:22 AM > Subject: Re: unique result > > I presume these all have different unique ids? > > If you can address it at indexing time, then have a look at > https://issues.apache.org/jira/browse/SOLR-799 > > Otherwise, you might look at https://issues.apache.org/jira/browse/SOLR-236 > > > On Feb 25, 2009, at 6:54 PM, Cheng Zhang wrote: > >> Is it possible to have Solr to remove duplicated query results? >> >> For example, instead of return >> >> >> Wireless >> Wireless >> Wireless >> Video Games >> Video Games >> >> >> return: >> >> Wireless >> Video Games >> >> >> Thanks a lot, >> Kevin >> > > -- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) > using Solr/Lucene: > http://www.lucidimagination.com/search >
Re: what crawler do you use for Solr indexing?
Hello, I built my own crawler with Python, as I couldn't find (not complaining, probably didn't look hard enough) nutch documentation. I use BeautifulSoup, because the site is mostly based on Python/Django, and we like Python. Writing one was good for us because we spent most of out time figuring out "what" to write ... how to fetch pages, which to choose, what data to store etc. It was an awesome exercise that really narrowed the definition of our project. It helped us define our solr schema and other parts of the project during development. If we knew exactly what sort of data to crawl, and exactly what we intended to save, I'm sure we would have pushed harder at figuring out nutch. If I was to refactor, I would give Heririx and Nutch good looks now. cheers gene Gene Campbell http:www.picante.co.nz gene at picante point co point nz http://www.travelbeen.com - "the social search engine for travel" On Tue, Mar 10, 2009 at 11:14 PM, Andrzej Bialecki wrote: > Sean Timm wrote: >> >> We too use Heritrix. We tried Nutch first but Nutch was not finding all >> of the documents that it was supposed to. When Nutch and Heritrix were >> both set to crawl our own site to a depth of three, Nutch missed some >> pages that were linked directly from the seed. We ended up with 10%-20% >> fewer pages in the Nutch crawl. > > FWIW, from a private conversation with Sean it seems that this was likely > related to the default configuration in Nutch, which collects only the first > 1000 outlinks from a page. This is an arbitrary and configurable limit, > introduced as a way to limit the impact of spam pages and to limit the size > of LinkDb. If a page hits this limit then indeed the symptoms that you > observe are missing (dropped) links. > > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > >