[jira] Commented: (SOLR-303) Distributed Search over HTTP

patrick o'leary (JIRA) Mon, 11 Feb 2008 20:48:32 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567953#action_12567953
 ]


patrick o'leary commented on SOLR-303:
--------------------------------------

It looks pretty good, I really need the ShardDoc's classes to be split up into 
public classes so I can use
them. 
It would also be fantastic to open up QueryComponent, my component only needs 
to over ride
a few functions, and it would so much cleaner to just extend QueryComponent 
rather than duplicate the code.

Also through testing, it might be worth while to apply a few negative edge 
cases.
e.g. duplicate documents in different shards. As systems get larger this is a 
huge possibility. Only fixed hash indexing could ensure you don't get 
duplicates, but if you try to have an extend-able  environment that might not 
be an option.

Took me a while to realize I had duplicated documents during indexing, but it 
causes NPEs in the query response writers, so not obvious or easy to figure out.

A solution would be to maintain map of unique fields as adding the ShardDocs to 
the priority queue, and continue on duplicates. You might also want to put some 
logic in there to ensure same shard doc is used for each duplicate doc, simple 
because the scores for identical doc's will be different across shards, and 
could change based upon order of which Shard responds first. This should 
eliminate that


So something like
QueryComponent.mergeIds
{code}

Map<Object, String> uniqueDoc = new HashMap<Object, String>();
      
      for (ShardResponse srsp : sreq.responses) {
        SolrDocumentList docs = srsp.rsp.getResults();
         ................
         ................
         // go through every doc in this response, construct a ShardDoc, and
        // put it in the priority queue so it can be ordered.
        for (int i=0; i<docs.size(); i++) {
          SolrDocument doc = docs.get(i);
          ..................
          ..................
          Object uniqueField = doc.getFieldValue(uniqueKeyField.getName());
          
          if(! uniqueDoc.containsKey(uniqueField)) {
                  shardDoc.setId(uniqueField);
                  uniqueDoc.put(uniqueField, shardDoc.shard);
          } else{
                  numFound--;
                  if(uniqueDoc.get(uniqueField).compareTo(shardDoc.shard) >0){
                         continue;
                  }
          }

          ..........................
          queue.insert(shardDoc);
        } // end for-each-doc-in-response
      } // end for-each-response
{code}

> Distributed Search over HTTP
> ----------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Assignee: Yonik Seeley
>         Attachments: distributed.patch, distributed.patch, distributed.patch, 
> distributed.patch, distributed.patch, distributed.patch, distributed.patch, 
> distributed.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, 
> fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, 
> fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Searching over multiple shards and aggregating results.
> Motivated by http://wiki.apache.org/solr/DistributedSearch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-303) Distributed Search over HTTP

Reply via email to