[ https://issues.apache.org/jira/browse/SAMZA-2783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726758#comment-17726758 ]
Andy Sautins commented on SAMZA-2783: ------------------------------------- PR: https://github.com/apache/samza/pull/1669 > Memoize DirDiffUtil to avoid repeated calls to areSameFile > ---------------------------------------------------------- > > Key: SAMZA-2783 > URL: https://issues.apache.org/jira/browse/SAMZA-2783 > Project: Samza > Issue Type: Improvement > Affects Versions: 1.4 > Reporter: Andy Sautins > Priority: Minor > > While profiling a Samza job it was noticed that, for this given job, ~38% of > the time was spent in > org.apache.samza.storage.blobstore.util.DirDiffUtil.getDirDiff, with the > primary contributor being areSameFile. > > Looking at the code it has the following comment: > DirDiffUtil.java:271 > {code:java} > // TODO MED shesharm: this compares each file in directory 3 times. > Categorize files in one traversal instead.{code} > > While re-structuring the code is an option, a quick win would be to memoize > the results from areSameFile. Re-structuring the code could potentially > result in a lower memory footprint ( memoize results are kept in memory ). -- This message was sent by Atlassian Jira (v8.20.10#820010)