asautins commented on PR #1669:
URL: https://github.com/apache/samza/pull/1669#issuecomment-1571271684

   I agree that the profile doesn't make sense.  We've profiled multiple times 
and all show `getDirDiff` to show higher than one would think if it were to run 
once a minute with a few files.  A few things that come to mind that may 
contribute:
   
      * Beam - The job is a beam job, not just a low-level or high-level samza 
job.  I wouldn't think that would matter.
      * 200k/sec - The job process ~200k records/second from 3 topics.  While 
that's not a lot, it's more than a little.
      * Join - The job joins a stream following the model in the [Beam 
Programming Guide section 11.5.1. Joining clicks and 
views](https://beam.apache.org/documentation/programming-guide/#joining-clicks-and-views).
 
      * GC using timers - There is also a timer use for garbage collection 
following the pattern in the [Beam Programming Guide section 11.4 garbage 
collecting 
state](https://beam.apache.org/documentation/programming-guide/#garbage-collecting-state).
      * ~15 stateid/~5 event timers -  So more than a few but less than a lot.
   
   Will update the ticket if we understand why we currently see `getDirDiff` so 
high in our profiles.  
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to