I added some comments on the PHOENIX-4757 On Thursday, September 13, 2018, 6:42:12 PM PDT, Josh Elser <els...@apache.org> wrote: Ahh, I get you now.
For a composite primary key made up of columns 1 through N, you want similar controls to compute the value of the salt based on a sequence of the columns 1 through M where M <= N (instead of always on all columns). For large numbers of salt buckets and a scan over a facet, you prune your search space considerably. Makes sense to me! On 9/13/18 6:37 PM, Gerald Sangudi wrote: > In case the text formatting is lost below, I also added it as a comment in > the JIRA ticket: > > https://issues.apache.org/jira/browse/PHOENIX-4757 > > > On Thu, Sep 13, 2018 at 3:24 PM, Gerald Sangudi <gsang...@23andme.com> > wrote: > >> Sorry I missed Josh's reply; I've subscribed to the dev list now. >> >> Below is a copy-and-paste from our internal document. Thanks in advance >> for your review and additional feedback on this. >> >> Gerald >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> *BackgroundWe make extensive use of multi-column rowkeys and salting >> <https://phoenix.apache.org/salted.html> in our different apache phoenix >> deployments. We frequently perform group-by aggregations on these data >> along a specific dimension that would benefit from predictably partitioning >> the data along that dimension. Proposal:We propose to add table metadata to >> allow schema designers to constrain salting to a subset of the rowkey, >> rather than the full rowkey as it is today. This will introduce a mechanism >> to partition data on a per-table basis along a single dimension without >> application changes or much change to the phoenix runtime logic. We expect >> this will result in substantially faster group-by’s along the salted >> dimension and negligible penalties elsewhere. This feature has also been >> proposed in PHOENIX-4757 >> <https://issues.apache.org/jira/browse/PHOENIX-4757> where it was pointed >> out that partitioning and sorting data along different dimensions is a >> common pattern in other datastores as well.Theoretically, it could cause >> hotspotting when querying along the salted dimension without the leading >> rowkey - that would be an anti-pattern.Usage ExampleCurrent:Schema:CREATE >> TABLE relationship (id_1 BIGINT NOT NULL,id_2 BIGINT NOT NULL,other_key >> BIGINT NOT NULL,val SMALLINT,CONSTRAINT pk PRIMARY KEY (id_1, id_2, >> other_key))SALT_BUCKETS=60;Query:Select id_2, sum(val)From >> relationshipWhere id_1 in (2,3)Group by id_2Explain:0: jdbc:phoenix:> >> EXPLAIN Select id_2, sum(val) From relationship Where id_1 in (2,3) Group >> by id_2 >> ;+-----------------------------------------------------------------------------------------+--------+| >> PLAN | EST_BY >> |+-----------------------------------------------------------------------------------------+--------+| >> CLIENT 60-CHUNK PARALLEL 60-WAY SKIP SCAN ON 120 KEYS OVER RELATIONSHIP >> [0,2] - [59,3] | null || SERVER AGGREGATE INTO DISTINCT ROWS BY [ID_2] >> | null || CLIENT MERGE SORT >> | >> null >> |+-----------------------------------------------------------------------------------------+--------+3 >> rows selected (0.048 seconds)In this case, although the group by is >> performed on both the client and regionserver, almost all of the actual >> grouping happens on the client because the id_2’s are randomly distributed >> across the regionservers. As a result, a lot of unnecessary data is >> serialized to the client and grouped serially there. This can become quite >> material with large resultsets.Proposed:Schema:CREATE TABLE relationship >> (id_1 BIGINT NOT NULL,id_2 BIGINT NOT NULL,other_key BIGINT NOT NULL,val >> SMALLINT,CONSTRAINT pk PRIMARY KEY (id_1, id_2, >> other_key),SALT_BUCKETS=60,SALT_COLUMN = id_2);Query (unchanged):Select >> id_2, sum(val)From relationshipWhere id_1 in (2,3)Group by id_2Explain >> (unchanged)Under the proposal, the data are merely partitioned so that all >> rows containing the same id_2 are on the same regionserver, the above query >> will perform almost all of the grouping in parallel on the regionservers. >> No special hint or changes to the query plan would be required to benefit. >> Tables would need to be re-salted to take advantage of the new >> functionality.Technical changes proposed to phoenix: - Create a new piece >> of table-level metadata: SALT_COLUMN. SALT_COLUMN will instruct the salting >> logic to generate a salt-byte based only on the specified column. If >> unspecified, it will behave as it does today and default to salting the >> entire rowkey. This metadata may be specified only when the table is >> created and may not be modified. The specified column must be part of the >> rowkey. - Modify all callers of getSaltingByte >> <https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/schema/SaltingUtil.java#L77>(byte[] >> value, int offset, int length, int bucketNum) to consistently leverage the >> new metadata.- Tests- DocsDesign points:One salt column vs multiple salt >> columns: Based on the existing signature for getSaltingByte, it seems >> simpler to only support a single SALT_COLUMN rather than multiple arbitrary >> SALT_COLUMNS. Known use-cases are completely supported by a single >> column.Syntax: PHOENIX-4757 >> <https://issues.apache.org/jira/browse/PHOENIX-4757> suggests an alternate, >> less verbose syntax for defining the salt bucket. The SALT_COLUMN syntax is >> suggested for clarity and consistency with other Phoenix table >> options.Future Enhancements (not in scope)Different aspects of the query >> execution runtime could take advantage of new metadata and implied >> knowledge that the data are partitioned in a predictable manner. For >> example: - It could be that client side grouping is completely unnecessary >> in cases where the SALT_COLUMN is part of the group-by expression.- A query >> that contains a literal equality predicate for the SALT_COLUMN can be >> isolated to a single regionserver, rather than broadcast to all >> regionservers.- A client-side merge-sort-join based on the SALT_COLUMN >> could optimize organization of merges. - Similarly, a server-side hash join >> could distribute only ‘necessary’ portions of the hash table to each >> regionserver.If additional advantages of these types come for free, then >> that’s great but can be follow on enhancements from the initial commit.* >> >> On Thu, Sep 13, 2018 at 9:33 AM, Thomas D'Silva <tdsi...@salesforce.com> >> wrote: >> >>> Gerald, >>> >>> I think you missed Josh's reply here : https://lists.apache.org/thr >>> ead.html/c5145461805429622a410c23c1199d578e146a5c94511b2d583 >>> 3438b@%3Cdev.phoenix.apache.org%3E >>> >>> Could you explain how using a subset of the pk columns to generate the >>> salt byte helps with partitioning, aggregations etc? >>> >>> Thanks, >>> Thomas >>> >>> On Thu, Sep 13, 2018 at 8:32 AM, Gerald Sangudi <gsang...@23andme.com> >>> wrote: >>> >>>> Hi folks, >>>> >>>> Any thoughts or feedback on this? >>>> >>>> Thanks, >>>> Gerald >>>> >>>> On Mon, Sep 10, 2018 at 1:56 PM, Gerald Sangudi <gsang...@23andme.com> >>>> wrote: >>>> >>>>> Hello folks, >>>>> >>>>> We have a requirement for salting based on partial, rather than full, >>>>> rowkeys. My colleague Mike Polcari has identified the requirement and >>>>> proposed an approach. >>>>> >>>>> I found an already-open JIRA ticket for the same issue: >>>>> https://issues.apache.org/jira/browse/PHOENIX-4757. I can provide more >>>>> details from the proposal. >>>>> >>>>> The JIRA proposes a syntax of SALT_BUCKETS(col, ...) = N, whereas Mike >>>>> proposes SALT_COLUMN=col or SALT_COLUMNS=col, ... . >>>>> >>>>> The benefit at issue is that users gain more control over partitioning, >>>>> and this can be used to push some additional aggregations and hash joins >>>>> down to region servers. >>>>> >>>>> I would appreciate any go-ahead / thoughts / guidance / objections / >>>>> feedback. I'd like to be sure that the concept at least is not >>>>> objectionable. We would like to work on this and submit a patch down the >>>>> road. I'll also add a note to the JIRA ticket. >>>>> >>>>> Thanks, >>>>> Gerald >>>>> >>>>> >>>> >>> >> >