Re: IDF in Distributed Search
Global IDF does not require another request/response. It is nearly free if you return the right info. Return the total number of docs and the df in the original response. Sum the doc counts and dfs, recompute the idf, and re-rank. See this post for an efficient way to do it: http://wunderwood.org/most_casual_observer/2007/04/progressive_reranking.htm l This works best if you treat the results from each server as a queue and refill just that queue when it is exhausted. All the good results might be from one server. wunder On 4/11/08 8:50 PM, "Yonik Seeley" <[EMAIL PROTECTED]> wrote: > On Fri, Apr 11, 2008 at 11:39 PM, Otis Gospodnetic > <[EMAIL PROTECTED]> wrote: >> So, I'd like to see what it would take to add distributed IDF info to Solr's >> distributed search. >> Here are some questions to get the discussion going: >> - Is anyone already working on it? >> - Does anyone plan on working on it in the very near future? >> - Does anyone already have thoughts how and where dist. idf could be plugged >> in? >> - There is a mention of dist idf and performance cost up there - any idea >> how costly dist idf would > > It's relatively easy to implement, but the performance cost is is not > negligible since it adds another search "phase" (another > request-response). It should be optional of course (globalidf=true), > so there is no reason not to add this feature. > > I also left room for this stage (ResponseBuilder.STAGE_PARSE_QUERY), > which is ordered before query execution. > > -Yonik
Re: IDF in Distributed Search
On Fri, Apr 11, 2008 at 11:39 PM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > So, I'd like to see what it would take to add distributed IDF info to Solr's > distributed search. > Here are some questions to get the discussion going: > - Is anyone already working on it? > - Does anyone plan on working on it in the very near future? > - Does anyone already have thoughts how and where dist. idf could be plugged > in? > - There is a mention of dist idf and performance cost up there - any idea > how costly dist idf would It's relatively easy to implement, but the performance cost is is not negligible since it adds another search "phase" (another request-response). It should be optional of course (globalidf=true), so there is no reason not to add this feature. I also left room for this stage (ResponseBuilder.STAGE_PARSE_QUERY), which is ordered before query execution. -Yonik
IDF in Distributed Search
Hi, With a well mixed distributed set of indices not having distributed/global IDF won't hurt much. But what if one has a not so well mixed up set of shards? One might want to apply rules when assigning documents to shards in order to group certain types of documents into only a subset of all shards instead of having them spread across all shards. Doing such careful sharding might allow the searcher to be smarter about which shards to search based on the query of client running the query, etc. Thus, I've run through comments on SOLR-303 to see what has been said about distributed IDF. Here is what I extracted: "## I'm not quite sure about GlobalCollectionStat. Is its purpose just to normalize weights from the shards?" "It's to make a distributed search score the same as it would if everything was in a single index. idf (inverse document frequency) is part of the scoring, so that component essentially does a distributed idf." "...distributed idf... this has a performance cost, and should matter little in a well mixed index." So, I'd like to see what it would take to add distributed IDF info to Solr's distributed search. Here are some questions to get the discussion going: - Is anyone already working on it? - Does anyone plan on working on it in the very near future? - Does anyone already have thoughts how and where dist. idf could be plugged in? - There is a mention of dist idf and performance cost up there - any idea how costly dist idf would be? Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
[jira] Updated: (SOLR-486) Support binary formats for QueryresponseWriter
[ https://issues.apache.org/jira/browse/SOLR-486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yonik Seeley updated SOLR-486: -- Attachment: SOLR-486.patch Revised patch that switches distributed search to use the binary format. Currently fails the distributed search tests though. > Support binary formats for QueryresponseWriter > -- > > Key: SOLR-486 > URL: https://issues.apache.org/jira/browse/SOLR-486 > Project: Solr > Issue Type: Improvement > Components: clients - java, search >Reporter: Noble Paul >Priority: Minor > Fix For: 1.3 > > Attachments: SOLR-486.patch, SOLR-486.patch, SOLR-486.patch, > SOLR-486.patch, SOLR-486.patch > > > QueryResponse writer only allows text data to be written. > So it is not possible to implement a binary protocol . Create another > interface which has a method > write(OutputStream os, SolrQueryRequest request, SolrQueryResponse response) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (SOLR-516) Add hl.alternateFieldLen parameter, to set max length for hl.alternateField
[ https://issues.apache.org/jira/browse/SOLR-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Klaas reassigned SOLR-516: --- Assignee: Mike Klaas > Add hl.alternateFieldLen parameter, to set max length for hl.alternateField > --- > > Key: SOLR-516 > URL: https://issues.apache.org/jira/browse/SOLR-516 > Project: Solr > Issue Type: Improvement > Components: highlighter >Reporter: Koji Sekiguchi >Assignee: Mike Klaas >Priority: Trivial > Attachments: SOLR-516-solr-ruby.patch, SOLR-516.patch, SOLR-516.patch > > > USE CASE: > You have a document that is composed of (short) title and (long) body fields > and want body to be highlighted. > In order to avoid highlighted body field to be empty, you can use > hl.alternateField parameter. > Although you want to set f.body.hl.alternateField=body, you may set > f.body.hl.alternateField=title, > because response time is awful when the body values are big. But the title > field provides users with > information smaller than body field. > In this case, you can use f.body.hl.alternateFieldLen=100 to limit the body > length to 100 characters. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (SOLR-516) Add hl.alternateFieldLen parameter, to set max length for hl.alternateField
[ https://issues.apache.org/jira/browse/SOLR-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Klaas resolved SOLR-516. - Resolution: Fixed > Add hl.alternateFieldLen parameter, to set max length for hl.alternateField > --- > > Key: SOLR-516 > URL: https://issues.apache.org/jira/browse/SOLR-516 > Project: Solr > Issue Type: Improvement > Components: highlighter >Reporter: Koji Sekiguchi >Assignee: Mike Klaas >Priority: Trivial > Attachments: SOLR-516-solr-ruby.patch, SOLR-516.patch, SOLR-516.patch > > > USE CASE: > You have a document that is composed of (short) title and (long) body fields > and want body to be highlighted. > In order to avoid highlighted body field to be empty, you can use > hl.alternateField parameter. > Although you want to set f.body.hl.alternateField=body, you may set > f.body.hl.alternateField=title, > because response time is awful when the body values are big. But the title > field provides users with > information smaller than body field. > In this case, you can use f.body.hl.alternateFieldLen=100 to limit the body > length to 100 characters. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-330) Use new Lucene Token APIs (reuse and char[] buff)
[ https://issues.apache.org/jira/browse/SOLR-330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yonik Seeley updated SOLR-330: -- Attachment: token_filter.patch Attaching token_filter.patch, minor update to synonym and WFD to prevent extra token creation. > Use new Lucene Token APIs (reuse and char[] buff) > - > > Key: SOLR-330 > URL: https://issues.apache.org/jira/browse/SOLR-330 > Project: Solr > Issue Type: Improvement >Reporter: Yonik Seeley >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-330.patch, SOLR-330.patch, token_filter.patch > > > Lucene is getting new Token APIs for better performance. > - token reuse > - char[] offset + len instead of String > Requires a new version of lucene. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Automatic binding of results to Beans (for solrj)
honestly have not looked at it in ages ;) make a patch and i'll check it over. I imagine it is pretty good... ryan On Apr 10, 2008, at 12:07 AM, Noble Paul നോബിള് नोब्ळ् wrote: hi Ryan , I can raise an issue and provide a patch. Is the proposed API fine or you wish it to be altered? --Noble On Thu, Apr 10, 2008 at 3:03 AM, Ryan McKinley <[EMAIL PROTECTED]> wrote: yes, in an early version of solrj, I had an annotation -> SolrDocument implementation. It also had a hibernate connection inspired by compass (http://www.compass-project.org/) -- it got tossed in an effort to simplify what got commited. check: http://solrstuff.org/svn/solrj-hibernate/ for an OLD version that won't compile with the current verison, but may be a good place to look ryan On Apr 9, 2008, at 2:49 PM, Noble Paul നോബിള് नोब्ळ् wrote: We can use annotations to bind SolrDocument to java beans directly. This can make the usage a bit simpler The QueryResponse class in solrj can have an extra method as follows public List getResultBeans(Class klass) and the bean can have annotations as class MyBean{ @Field("id") //name is optional String id; @Field("category") List categories } -- --Noble Paul -- --Noble Paul
Re: [jira] Commented: (SOLR-516) Add hl.alternateFieldLen parameter, to set max length for hl.alternateField
I opened SOLR-537 for solr-ruby as Hoss suggested. Thank you, Koji Chris Hostetter wrote: : I have zero familiarity with the ruby side of Solr, so I will leave the issue open for the ruby client patch to be reviewed and applied. since client work and server work are parallel but distinct, I would suggest cloning the issue so the ruby work can be tracked separately. it makes the issue statuses and CHANGES.txt more reflective of reality. -Hoss
[jira] Updated: (SOLR-537) Use hl.maxAlternateFieldLength parameter from solr-ruby
[ https://issues.apache.org/jira/browse/SOLR-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated SOLR-537: Attachment: SOLR-537.patch a patch to use hl.maxAlternateFieldLength parameter from solr-ruby > Use hl.maxAlternateFieldLength parameter from solr-ruby > --- > > Key: SOLR-537 > URL: https://issues.apache.org/jira/browse/SOLR-537 > Project: Solr > Issue Type: Improvement > Components: highlighter >Affects Versions: 1.3 >Reporter: Koji Sekiguchi >Priority: Trivial > Attachments: SOLR-537.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (SOLR-537) Use hl.maxAlternateFieldLength parameter from solr-ruby
Use hl.maxAlternateFieldLength parameter from solr-ruby --- Key: SOLR-537 URL: https://issues.apache.org/jira/browse/SOLR-537 Project: Solr Issue Type: Improvement Components: highlighter Affects Versions: 1.3 Reporter: Koji Sekiguchi Priority: Trivial Attachments: SOLR-537.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.