ali-ghanbari commented on a change in pull request #233: URL: https://github.com/apache/commons-text/pull/233#discussion_r675728915
########## File path: src/main/java/org/apache/commons/text/similarity/LongestCommonSubsequence.java ########## @@ -120,24 +170,72 @@ public CharSequence longestCommonSubsequence(final CharSequence left, final Char if (left == null || right == null) { throw new IllegalArgumentException("Inputs must not be null"); } - final StringBuilder longestCommonSubstringArray = new StringBuilder(Math.max(left.length(), right.length())); - final int[][] lcsLengthArray = longestCommonSubstringLengthArray(left, right); - int i = left.length() - 1; - int j = right.length() - 1; - int k = lcsLengthArray[left.length()][right.length()] - 1; - while (k >= 0) { - if (left.charAt(i) == right.charAt(j)) { - longestCommonSubstringArray.append(left.charAt(i)); - i = i - 1; - j = j - 1; - k = k - 1; - } else if (lcsLengthArray[i + 1][j] < lcsLengthArray[i][j + 1]) { - i = i - 1; - } else { - j = j - 1; - } + // Find lengths of two strings + final int leftSz = left.length(); + final int rightSz = right.length(); + + // Check if we can save even more space + if (leftSz < rightSz) { + return algorithmC(right, rightSz, left, leftSz); } - return longestCommonSubstringArray.reverse().toString(); + return algorithmC(left, leftSz, right, rightSz); + } + + /** + * An implementation of "ALG C" from Hirschberg's paper <a href="https://dl.acm.org/doi/10.1145/360825.360861">A linear space algorithm for computing maximal common subsequences</a>. + * Assuming the sequence <code>left</code> is of size <code>m</code> and the sequence <code>right</code> is of size <code>n</code>, + * this method returns Longest Common Subsequence (LCS) the two sequences. + * As per the paper, this method runs in O(m * n) time and O(m + n) space. + * + * @param left Left sequence + * @param m Length of left sequence + * @param right Right sequence + * @param n Length of right sequence + * @return LCS of <code>left</code> and <code>right</code> + */ + static CharSequence algorithmC(final CharSequence left, final int m, Review comment: Hello! > We are almost there. Just wanted to check a few more things before digging into algorithm C and finish reading the paper and code here. I just wanted to let you know that I have made some refactoring. Instead of just passing string sequences to `algorithmB` and `algorithmC`, I have modified the methods to take the strings with their start and end index. I also added an extra parameter to `algorithmB` so that we can traverse the string backward instead of literally reversing the input strings (in previous version of `algorithmC` we were reversing the arguments in one of the calls to `algorithmB`). This refactoring should certainly increase the _dexterity_ of the implementation as we avoid several object creation, string reversal, and virtual method calls (a JMH might be needed anyway, as other maintainers also mentioned). Besides that, I think this refactor was, in fact, necessary to make the implementation as similar to the original algorithm as possible: the proof for space analysis of Algorithm C says "_We assume [...] and substrings can be transferred as arguments by giving initial and final locations_." I hope with this change, I have not interrupted your review process! Please let me know if adding more comments helps. > I think I prefer to keep a single algorithm for the sake of maintainability. And since C performs better, I'd keep just C. > What do you think? I think we cannot get rid of Algorithm B, as it is being used as an auxiliary procedure in implementing Algorithm C: please see step 3 of Algorithm C in the paper. But, I understand the maintainability concern. How about doing this: we could keep **only** Algorithm B in this version to improve `apply` method only; we can keep the old implementation (dynamic programming) of the method `longestCommonSubsequence` for now. We can mention that improving space complexity of `longestCommonSubsequence` is a work in progress. But if we want to keep Algorithm C, we must keep Algorithm B as it is being called from Algorithm C. And in that case (in my opinion) it is ideal to use Algorithm B for `apply` method, because we already have the implementation and it involves less computation than Algorithm C. Please let me know if you prefer the former approach. In that case I can revert the code for `longestCommonSubsequence`. Thanks a lot! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@commons.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org