Re: [PR] HIVE-27827: Improve performance of direct SQL implement for getPartitionsByFilter [hive]

via GitHub Tue, 19 Mar 2024 02:47:00 -0700


cxzl25 commented on code in PR #4831:
URL: https://github.com/apache/hive/pull/4831#discussion_r1530023612



##########
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##########
@@ -1520,10 +1523,43 @@ public void visit(LeafNode node) throws MetaException {
       if (node.operator == Operator.LIKE) {
         nodeValue0 = nodeValue0 + " ESCAPE '\\' ";
       }
+      String filter = node.isReverseOrder
+              ? nodeValue0 + " " + node.operator.getSqlOp() + " " + tableValue
+              : tableValue + " " + node.operator.getSqlOp() + " " + nodeValue0;
+      // For equals and not-equals filter, we can add partition name filter to 
improve performance.
+      boolean isOpEquals = Operator.isEqualOperator(node.operator);
+      if (isOpEquals || Operator.isNotEqualOperator(node.operator)) {
+        Map<String, String> partKeyToVal = new HashMap<>();
+        partKeyToVal.put(partCol.getName(), nodeValue.toString());
+        String escapedNameFragment = Warehouse.makePartName(partKeyToVal, 
false);
+        if (colType == FilterType.Date) {
+          // Some engines like Pig will record both date and time values, in 
which case we need
+          // match PART_NAME by like clause.
+          escapedNameFragment += "%";
+        }
+        if (colType != FilterType.Date && partColCount == 1) {
+          // Case where partition column type is not date and there is no 
other partition columns
+          params.add(escapedNameFragment);
+          filter += " and " + PARTITIONS + ".\"PART_NAME\"" + (isOpEquals ? " 
=? " : " !=? ");
+        } else {
+          if (partColCount == 1) {
+            // Case where partition column type is date and there is no other 
partition columns
+            params.add(escapedNameFragment);
+          } else if (partColIndex + 1 == partColCount) {
+            // Case where the partition column is at the end of the name.
+            params.add("%/" + escapedNameFragment);
+          } else if (partColIndex == 0) {
+            // Case where the partition column is at the beginning of the name.
+            params.add(escapedNameFragment + "/%");

Review Comment:
   Thank you for your contribution. We also made similar optimizations last 
year, which greatly reduced the load of MySQL.
   
   However, I suggest `params.add(escapedNameFragment + "/%");` This can be 
transformed into a multi-level partition filter, which can filter more data 
when scanning.
   
   Now it looks like this
   ```
   d=xxx/%
   ```
   
   <img width="468" alt="image" 
src="https://github.com/apache/hive/assets/3898450/d8339b08-bc69-42dc-ad09-7887731c0f46";>
   
   
   Advanced optimization
   ```
   d=xxx/h=_%/b=_%
   ```
   
   <img width="641" alt="image" 
src="https://github.com/apache/hive/assets/3898450/4bacb055-136d-464c-8a80-547636ec7e0b";>
   
   ---
   
   If you think this is a good idea, I will create a PR.  @wecharyu 
@saihemanth-cloudera 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org
For additional commands, e-mail: gitbox-h...@hive.apache.org

Re: [PR] HIVE-27827: Improve performance of direct SQL implement for getPartitionsByFilter [hive]

Reply via email to