mcvsubbu commented on a change in pull request #3979: Track Indexed timestamp 
across consuming segments
URL: https://github.com/apache/incubator-pinot/pull/3979#discussion_r266514769
 
 

 ##########
 File path: 
pinot-core/src/main/java/org/apache/pinot/core/indexsegment/mutable/MutableSegmentImpl.java
 ##########
 @@ -204,20 +205,22 @@ public boolean index(GenericRow row) {
     // If metrics aggregation is enabled and if the dimension values were 
already seen, this will return existing docId,
     // else this will return a new docId.
     int docId = getOrCreateDocId(dictIdMap);
-
+    boolean canTakeMore = false;
     // docId == numDocs implies new docId.
     if (docId == numDocs) {
       // Add forward and inverted indices for new document.
       addForwardIndex(row, docId, dictIdMap);
       addInvertedIndex(docId, dictIdMap);
       // Update number of document indexed at last to make the latest record 
queryable
-      return _numDocsIndexed++ < _capacity;
+      canTakeMore = _numDocsIndexed++ < _capacity;
     } else {
-      Preconditions
-          .checkState(_aggregateMetrics, "Invalid document-id during indexing: 
" + docId + " expected: " + numDocs);
+      Preconditions.checkState(_aggregateMetrics, "Invalid document-id during 
indexing: " + docId + " expected: " + numDocs);
       // Update metrics for existing document.
-      return aggregateMetrics(row, docId);
+      canTakeMore = aggregateMetrics(row, docId);
     }
+    // update indexing time
+    _lastIndexedTimestamp = System.currentTimeMillis();
 
 Review comment:
   If there is a delay in the stream, that will show up as a pinot delay. 
Another choice is to take the last _polled_ timestamp. That will ensure that as 
long as Pinot polls the stream, the available data is considered fresh. Not 
sure why we are not doing that.  A stream might (and probably should) have its 
own staleness metric being tracked and monitored so that an application does 
not need to track a staleness metric via pinot and root cause it to the stream 
if that is indeed the case.
   
   If you want to continue with this approach, then I suggest that you extend 
the interface to retrieve rows from a stream to include the highest available 
timestamp of a record _in_ the stream. That way, streams that provide the 
staleness metric can translate (via Pinot) to the staleness monitoring 
application.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to