[jira] [Updated] (SOLR-12658) Extend support for more than 4 field in 'partitionKeys' in ParallelStream after SOLR-11598

Amrit Sarkar (JIRA) Sun, 12 Aug 2018 08:35:17 -0700


     [ 
https://issues.apache.org/jira/browse/SOLR-12658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Amrit Sarkar updated SOLR-12658:
--------------------------------
    Description: 
SOLR-11598 extended the capabilities for Export handler to have more than 4 
fields for sorting.

As streaming expressions leverages Export handler, ParallelStream allowed 
maximum 4 fields in "{color:blue}partitionKeys{color}" and silently ignored 
rest of the fields if more than 4 are specified.

 HashQParserPlugin:CompositeHash: 347
{code}
  private static class CompositeHash implements HashKey {

    private HashKey key1;
    private HashKey key2;
    private HashKey key3;
    private HashKey key4;

    public CompositeHash(HashKey[] hashKeys) {
      key1 = hashKeys[0];
      key2 = hashKeys[1];
      key3 = (hashKeys.length > 2) ? hashKeys[2] : new ZeroHash();
      key4 = (hashKeys.length > 3) ? hashKeys[3] : new ZeroHash();
    }

    public void setNextReader(LeafReaderContext context) throws IOException {
      key1.setNextReader(context);
      key2.setNextReader(context);
      key3.setNextReader(context);
      key4.setNextReader(context);
    }

    public long hashCode(int doc) throws IOException {
      return 
key1.hashCode(doc)+key2.hashCode(doc)+key3.hashCode(doc)+key4.hashCode(doc);
    }
  }
{code}

To make sure we have documents distributed across workers when executing 
streaming expression parallely, all the fields specified in 'partitionKeys' 
should be considered in calculating to which worker particular document should 
go for further processing.

Use-case where having this flexibility would beneficial:

{code}
parallel(workerCollection,
         search(collection1, q=*:*, fl="id, org, dept, year, month, date, 
hour", 
          sort="org desc, dept dec, year desc, month desc, date desc, hour 
desc", 
          partitionKeys="org, dept, year, month"),
          workers="6",
          zkHost="localhost:9983",
          sort="year desc")
{code}

In this case, we are partitioning on "org, dept, year, month". 
Now look at the data:
org dept year month date hour
{code}
org1 dept1 1991 jan 24 11
org1 dept1 1991 jan 24 12
org1 dept1 1991 jan 24 13
....................
....................
org2 dept1 1991 jan 24 11
{code}

For data to be distributed equally to stated "6" workers, 6 respective subsets 
needs to be created at first place. 
As we can see in the data, the partition keys specified have two unique sets 
{"org1 dept1 1991 jan", "org2 dept2 1991 jan"} and only 2 workers will be used 
out of 6. 
Also, if we look at the data we have documents for "org1" are much more than 
"org2", leading to one of workers doing more work than the other; where better 
partition of data could have optimised the processing of documents.






  was:
SOLR-11598 extended the capabilities for Export handler to have more than 4 
fields for sorting.

As streaming expressions leverages Export handler, ParallelStream allowed 
maximum 4 fields in "{color:blue}partitionKeys{color}" and silently ignored 
rest of the fields if more than 4 are specified.

 HashQParserPlugin:CompositeHash: 347
{code}
  private static class CompositeHash implements HashKey {

    private HashKey key1;
    private HashKey key2;
    private HashKey key3;
    private HashKey key4;

    public CompositeHash(HashKey[] hashKeys) {
      key1 = hashKeys[0];
      key2 = hashKeys[1];
      key3 = (hashKeys.length > 2) ? hashKeys[2] : new ZeroHash();
      key4 = (hashKeys.length > 3) ? hashKeys[3] : new ZeroHash();
    }

    public void setNextReader(LeafReaderContext context) throws IOException {
      key1.setNextReader(context);
      key2.setNextReader(context);
      key3.setNextReader(context);
      key4.setNextReader(context);
    }

    public long hashCode(int doc) throws IOException {
      return 
key1.hashCode(doc)+key2.hashCode(doc)+key3.hashCode(doc)+key4.hashCode(doc);
    }
  }
{code}

To make sure we have documents distributed across workers when executing 
streaming expression parallely, all the fields specified in 'partitionKeys' 
should be considered in calculating to which worker particular document should 
go for further processing.




> Extend support for more than 4 field in 'partitionKeys' in ParallelStream 
> after SOLR-11598
> ------------------------------------------------------------------------------------------
>
>                 Key: SOLR-12658
>                 URL: https://issues.apache.org/jira/browse/SOLR-12658
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: streaming expressions
>            Reporter: Amrit Sarkar
>            Priority: Minor
>         Attachments: SOLR-12658.patch
>
>
> SOLR-11598 extended the capabilities for Export handler to have more than 4 
> fields for sorting.
> As streaming expressions leverages Export handler, ParallelStream allowed 
> maximum 4 fields in "{color:blue}partitionKeys{color}" and silently ignored 
> rest of the fields if more than 4 are specified.
>  HashQParserPlugin:CompositeHash: 347
> {code}
>   private static class CompositeHash implements HashKey {
>     private HashKey key1;
>     private HashKey key2;
>     private HashKey key3;
>     private HashKey key4;
>     public CompositeHash(HashKey[] hashKeys) {
>       key1 = hashKeys[0];
>       key2 = hashKeys[1];
>       key3 = (hashKeys.length > 2) ? hashKeys[2] : new ZeroHash();
>       key4 = (hashKeys.length > 3) ? hashKeys[3] : new ZeroHash();
>     }
>     public void setNextReader(LeafReaderContext context) throws IOException {
>       key1.setNextReader(context);
>       key2.setNextReader(context);
>       key3.setNextReader(context);
>       key4.setNextReader(context);
>     }
>     public long hashCode(int doc) throws IOException {
>       return 
> key1.hashCode(doc)+key2.hashCode(doc)+key3.hashCode(doc)+key4.hashCode(doc);
>     }
>   }
> {code}
> To make sure we have documents distributed across workers when executing 
> streaming expression parallely, all the fields specified in 'partitionKeys' 
> should be considered in calculating to which worker particular document 
> should go for further processing.
> Use-case where having this flexibility would beneficial:
> {code}
> parallel(workerCollection,
>          search(collection1, q=*:*, fl="id, org, dept, year, month, date, 
> hour", 
>           sort="org desc, dept dec, year desc, month desc, date desc, hour 
> desc", 
>           partitionKeys="org, dept, year, month"),
>           workers="6",
>           zkHost="localhost:9983",
>           sort="year desc")
> {code}
> In this case, we are partitioning on "org, dept, year, month". 
> Now look at the data:
> org dept year month date hour
> {code}
> org1 dept1 1991 jan 24 11
> org1 dept1 1991 jan 24 12
> org1 dept1 1991 jan 24 13
> ....................
> ....................
> org2 dept1 1991 jan 24 11
> {code}
> For data to be distributed equally to stated "6" workers, 6 respective 
> subsets needs to be created at first place. 
> As we can see in the data, the partition keys specified have two unique sets 
> {"org1 dept1 1991 jan", "org2 dept2 1991 jan"} and only 2 workers will be 
> used out of 6. 
> Also, if we look at the data we have documents for "org1" are much more than 
> "org2", leading to one of workers doing more work than the other; where 
> better partition of data could have optimised the processing of documents.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-12658) Extend support for more than 4 field in 'partitionKeys' in ParallelStream after SOLR-11598

Reply via email to