[ 
https://issues.apache.org/jira/browse/SOLR-12587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16585428#comment-16585428
 ] 

Varun Thacker commented on SOLR-12587:
--------------------------------------

When I took a deeper look at it today there are still a few subtle things that 
wasn't obvious to me :
 # The Solr PQ has a reset method which resets the size to maxSize and then 
does a System.arraycopy . If we were to use the Lucene PQ we don't have a way 
to reset size to maxSize . Secondly we would no longer do System.arraycopy and 
instead reset the heap in the for loop which is probably slower and hence was 
done like this in the first place? A 25M export on the "id" field used to take 
7m15s now took 10.54s when i simulated this by not reusing the PQ and creating 
a new PQ for every 30k docs collected in ExportWriter (which was earlier using 
the reset )
{code:java}
protected void reset() {
  Object[] heap = getHeapArray();
  if(cache != null) {
    System.arraycopy(cache, 1, heap, 1, heap.length-1);
    size = maxSize;
  } else {
    populate();
  }
}{code}

 # We could perhaps do a "true" reset and even avoid doing a System.arraycopy , 
if we never nulled the object we popped and relied on size do do the right 
thing. Then reset would simply change call SortDoc#reset and change back size 
to maxSize. We would save a lot of objects generated
{code:java}
public final T pop() {
  if (size > 0) {
    T result = heap[1];       // save first value
    heap[1] = heap[size];     // move last to first
    heap[size] = null;        // permit GC of objects //<---------- remove this 
line
    size--;
    downHeap();               // adjust heap
    return result;
  } else {
    return null;
  }
}
// pseudo code for reset
protected void reset() {
  Object[] heap = getHeapArray();
  for (int i = 1; i < heap.length; i++) {
    ((SortDoc) heap[i]).reset();
  }
  size = maxSize;
}{code}
 

In approach 1 , we'd essentially be giving up on whatever optimizations 
System.arraycopy does ( being a native call ) vs relying on a for loop. 
In approach 2 , we'd basically be creating some sort of a reusable PQ 



Thoughts ?

> Reuse Lucene's PriorityQueue for the ExportHandler
> --------------------------------------------------
>
>                 Key: SOLR-12587
>                 URL: https://issues.apache.org/jira/browse/SOLR-12587
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Varun Thacker
>            Assignee: Varun Thacker
>            Priority: Major
>              Labels: export-writer
>         Attachments: SOLR-12587.patch, SOLR-12587.patch
>
>
> We have a priority queue in Lucene  {{org.apache.lucene.utilPriorityQueue}} . 
> The Export Handler also implements a PriorityQueue 
> {{org.apache.solr.handler.export.PriorityQueue}} . Both are obviously very 
> similar with minor API differences. 
>  
> The aim here is to reuse Lucene's PQ and remove the Solr implementation. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to