[jira] [Updated] (SOLR-12684) Document speed gotchas and partitionKeys usage for ParallelStream

Varun Thacker (JIRA) Mon, 20 Aug 2018 14:44:12 -0700


     [ 
https://issues.apache.org/jira/browse/SOLR-12684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Varun Thacker updated SOLR-12684:
---------------------------------
    Description: 
The aim of this Jira is to beef up the ref guide around parallel stream

There are two things I want to address:

 

Firstly usage of partitionKeys :

This line in the ref guide indicates that parallel stream keys should always be 
the same as the underlying sort criteria 
{code:java}
The parallel function maintains the sort order of the tuples returned by the 
worker nodes, so the sort criteria of the parallel function must match up with 
the sort order of the tuples returned by the workers.
{code}
But as discussed on SOLR-12635 , Joel provided an example
{code:java}
The hash partitioner just needs to send documents to the same worker node. You 
could do that with just one partitioning key

For example if you sort on year, month and day. You could partition on year 
only and still be fine as long as there was enough different years to spread 
the records around the worker nodes.{code}
So we should make this more clear in the ref guide.

Let's also document that specifying more than 4 partitionKeys will throw an 
error after SOLR-12683

 

At this point the user will understand how to use partitonKeys . It's related 
to the sort criteria but should not have all the sort fields 

 

We should now mention a trick where the user could warn up the hash queries as 
they are always run on the whole document set ( irrespective of the filter 
criterias )

also users should only use parallel when the docs matching post filter 
criterias is very large .  
{code:java}
<listener event="newSearcher" class="solr.QuerySenderListener">
<arr name="queries">

<lst><str name="q">:</str><str name="fq">{!hash workers=6 worker=0}</str><str 
name="partitionKeys">myPartitionKey</str></lst>
<lst><str name="q">:</str><str name="fq">{!hash workers=6 worker=1}</str><str 
name="partitionKeys">myPartitionKey</str></lst>
<lst><str name="q">:</str><str name="fq">{!hash workers=6 worker=2}</str><str 
name="partitionKeys">myPartitionKey</str></lst>
<lst><str name="q">:</str><str name="fq">{!hash workers=6 worker=3}</str><str 
name="partitionKeys">myPartitionKey</str></lst>
<lst><str name="q">:</str><str name="fq">{!hash workers=6 worker=4}</str><str 
name="partitionKeys">myPartitionKey</str></lst>
<lst><str name="q">:</str><str name="fq">{!hash workers=6 worker=5}</str><str 
name="partitionKeys">myPartitionKey</str></lst>
</arr>
</listener>{code}

  

  was:
The aim of this Jira is to beef up the ref guide around parallel stream

There are two things I want to address:

 

Firstly usage of partitionKeys :

This line in the ref guide indicates that parallel stream keys should always be 
the same as the underlying sort criteria 
{code:java}
The parallel function maintains the sort order of the tuples returned by the 
worker nodes, so the sort criteria of the parallel function must match up with 
the sort order of the tuples returned by the workers.
{code}
But as discussed on SOLR-12635 , Joel provided an example
{code:java}
The hash partitioner just needs to send documents to the same worker node. You 
could do that with just one partitioning key

For example if you sort on year, month and day. You could partition on year 
only and still be fine as long as there was enough different years to spread 
the records around the worker nodes.{code}
So we should make this more clear in the ref guide.

Let's also document that specifying more than 4 partitionKeys will throw an 
error after SOLR-12683

 

At this point the user will understand how to use partitonKeys . It's related 
to the sort criteria but should not have all the sort fields 

 

We should now mention a trick where the user could warn up the hash queries as 
they are always run on the whole document set ( irrespective of the filter 
criterias )

also users should only use parallel when the docs matching post filter 
criterias is very large .  

 
<listener event="newSearcher" class="solr.QuerySenderListener">
  <arr name="queries">

    <lst><str name="q">*:*</str><str name="fq">\{!hash workers=6 
worker=0}</str><str name="partitionKeys">myPartitionKey</str></lst>
    <lst><str name="q">*:*</str><str name="fq">\{!hash workers=6 
worker=1}</str><str name="partitionKeys">myPartitionKey</str></lst>
    <lst><str name="q">*:*</str><str name="fq">\{!hash workers=6 
worker=2}</str><str name="partitionKeys">myPartitionKey</str></lst>
    <lst><str name="q">*:*</str><str name="fq">\{!hash workers=6 
worker=3}</str><str name="partitionKeys">myPartitionKey</str></lst>
    <lst><str name="q">*:*</str><str name="fq">\{!hash workers=6 
worker=4}</str><str name="partitionKeys">myPartitionKey</str></lst>
    <lst><str name="q">*:*</str><str name="fq">\{!hash workers=6 
worker=5}</str><str name="partitionKeys">myPartitionKey</str></lst>
  </arr>
</listener>
 


> Document speed gotchas and partitionKeys usage for ParallelStream
> -----------------------------------------------------------------
>
>                 Key: SOLR-12684
>                 URL: https://issues.apache.org/jira/browse/SOLR-12684
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Varun Thacker
>            Assignee: Varun Thacker
>            Priority: Major
>
> The aim of this Jira is to beef up the ref guide around parallel stream
> There are two things I want to address:
>  
> Firstly usage of partitionKeys :
> This line in the ref guide indicates that parallel stream keys should always 
> be the same as the underlying sort criteria 
> {code:java}
> The parallel function maintains the sort order of the tuples returned by the 
> worker nodes, so the sort criteria of the parallel function must match up 
> with the sort order of the tuples returned by the workers.
> {code}
> But as discussed on SOLR-12635 , Joel provided an example
> {code:java}
> The hash partitioner just needs to send documents to the same worker node. 
> You could do that with just one partitioning key
> For example if you sort on year, month and day. You could partition on year 
> only and still be fine as long as there was enough different years to spread 
> the records around the worker nodes.{code}
> So we should make this more clear in the ref guide.
> Let's also document that specifying more than 4 partitionKeys will throw an 
> error after SOLR-12683
>  
> At this point the user will understand how to use partitonKeys . It's related 
> to the sort criteria but should not have all the sort fields 
>  
> We should now mention a trick where the user could warn up the hash queries 
> as they are always run on the whole document set ( irrespective of the filter 
> criterias )
> also users should only use parallel when the docs matching post filter 
> criterias is very large .  
> {code:java}
> <listener event="newSearcher" class="solr.QuerySenderListener">
> <arr name="queries">
> <lst><str name="q">:</str><str name="fq">{!hash workers=6 worker=0}</str><str 
> name="partitionKeys">myPartitionKey</str></lst>
> <lst><str name="q">:</str><str name="fq">{!hash workers=6 worker=1}</str><str 
> name="partitionKeys">myPartitionKey</str></lst>
> <lst><str name="q">:</str><str name="fq">{!hash workers=6 worker=2}</str><str 
> name="partitionKeys">myPartitionKey</str></lst>
> <lst><str name="q">:</str><str name="fq">{!hash workers=6 worker=3}</str><str 
> name="partitionKeys">myPartitionKey</str></lst>
> <lst><str name="q">:</str><str name="fq">{!hash workers=6 worker=4}</str><str 
> name="partitionKeys">myPartitionKey</str></lst>
> <lst><str name="q">:</str><str name="fq">{!hash workers=6 worker=5}</str><str 
> name="partitionKeys">myPartitionKey</str></lst>
> </arr>
> </listener>{code}
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-12684) Document speed gotchas and partitionKeys usage for ParallelStream

Reply via email to