[GitHub] szlta commented on a change in pull request #538: HIVE-21217: Optimize range calculation for PTF

GitBox Tue, 19 Feb 2019 07:30:25 -0800

szlta commented on a change in pull request #538: HIVE-21217: Optimize range 
calculation for PTF
URL: https://github.com/apache/hive/pull/538#discussion_r258090284


 ##########
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/ValueBoundaryScanner.java
 ##########
 @@ -44,10 +49,207 @@ public ValueBoundaryScanner(BoundaryDef start, 
BoundaryDef end, boolean nullsLas
     this.nullsLast = nullsLast;
   }
 
+  public abstract Object computeValue(Object row) throws HiveException;
+
+  /**
+   * Checks if the distance of v2 to v1 is greater than the given amt.
+   * @return True if the value of v1 - v2 is greater than amt or either value 
is null.
+   */
+  public abstract boolean isDistanceGreater(Object v1, Object v2, int amt);
+
+  /**
+   * Checks if the values of v1 or v2 are the same.
+   * @return True if both values are the same or both are nulls.
+   */
+  public abstract boolean isEqual(Object v1, Object v2);
+
   public abstract int computeStart(int rowIdx, PTFPartition p) throws 
HiveException;
 
   public abstract int computeEnd(int rowIdx, PTFPartition p) throws 
HiveException;
 
+  /**
+   * Checks and maintains cache content - optimizes cache window to always be 
around current row
+   * thereby makes it follow the current progress.
+   * @param rowIdx current row
+   * @param p current partition for the PTF operator
+   * @throws HiveException
+   */
+  public void handleCache(int rowIdx, PTFPartition p) throws HiveException {
+    BoundaryCache cache = p.getBoundaryCache();
+    if (cache == null) {
+      return;
+    }
+
+    //Start of partition
+    if (rowIdx == 0) {
+      cache.clear();
+    }
+    if (cache.isComplete()) {
+      return;
+    }
+
+    int cachePos = cache.approxCachePositionOf(rowIdx);
+
+    if (cache.isEmpty()) {
+      fillCacheUntilEndOrFull(rowIdx, p);
+    } else if (cachePos > 50 && cachePos <= 75) {
 
 Review comment:
   We don't know the sizes beforehand. The numbers defined by user in the 
window definition are matched to values of orderby col, e.g. preceding 2, 
following 2 might mean a few rows but can also mean thousands if we have more 
of the same values or the orderby col is of double type.
   That said I've updated the cache window moving code so that it is optimized 
on doing minimum number of cache misses thereby reads (as discussed offline)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] szlta commented on a change in pull request #538: HIVE-21217: Optimize range calculation for PTF

Reply via email to