[GitHub] [commons-text] ali-ghanbari commented on a change in pull request #233: A More Efficient Implementation for Calculating Size of Longest Common Subsequence

GitBox Fri, 23 Jul 2021 10:44:59 -0700


ali-ghanbari commented on a change in pull request #233:
URL: https://github.com/apache/commons-text/pull/233#discussion_r675728915




##########
File path: 
src/main/java/org/apache/commons/text/similarity/LongestCommonSubsequence.java
##########
@@ -120,24 +170,72 @@ public CharSequence longestCommonSubsequence(final 
CharSequence left, final Char
        if (left == null || right == null) {
            throw new IllegalArgumentException("Inputs must not be null");
        }
-       final StringBuilder longestCommonSubstringArray = new 
StringBuilder(Math.max(left.length(), right.length()));
-       final int[][] lcsLengthArray = longestCommonSubstringLengthArray(left, 
right);
-       int i = left.length() - 1;
-       int j = right.length() - 1;
-       int k = lcsLengthArray[left.length()][right.length()] - 1;
-       while (k >= 0) {
-           if (left.charAt(i) == right.charAt(j)) {
-               longestCommonSubstringArray.append(left.charAt(i));
-               i = i - 1;
-               j = j - 1;
-               k = k - 1;
-           } else if (lcsLengthArray[i + 1][j] < lcsLengthArray[i][j + 1]) {
-               i = i - 1;
-           } else {
-               j = j - 1;
-           }
+       // Find lengths of two strings
+       final int leftSz = left.length();
+       final int rightSz = right.length();
+
+       // Check if we can save even more space
+       if (leftSz < rightSz) {
+           return algorithmC(right, rightSz, left, leftSz);
        }
-       return longestCommonSubstringArray.reverse().toString();
+       return algorithmC(left, leftSz, right, rightSz);
+   }
+
+    /**
+     * An implementation of "ALG C" from Hirschberg's paper <a 
href="https://dl.acm.org/doi/10.1145/360825.360861";>A linear space algorithm 
for computing maximal common subsequences</a>.
+     * Assuming the sequence <code>left</code> is of size <code>m</code> and 
the sequence <code>right</code> is of size <code>n</code>,
+     * this method returns Longest Common Subsequence (LCS) the two sequences.
+     * As per the paper, this method runs in O(m * n) time and O(m + n) space.
+     *
+     * @param left Left sequence
+     * @param m Length of left sequence
+     * @param right Right sequence
+     * @param n Length of right sequence
+     * @return LCS of <code>left</code> and <code>right</code>
+     */
+   static CharSequence algorithmC(final CharSequence left, final int m,

Review comment:
       Hello!
   
   > We are almost there. Just wanted to check a few more things before digging 
into algorithm C and finish reading the paper and code here.
   
   I just wanted to let you know that I have made some refactoring.
   Instead of just passing string sequences to `algorithmB` and `algorithmC`, I 
have modified the methods to take the strings with their start and end index.
   I also added an extra parameter to `algorithmB` so that we can traverse the 
string backward instead of literally reversing the input strings (in previous 
version of `algorithmC` we were reversing the arguments in one of the calls to 
`algorithmB`).
   This refactoring should certainly increase the _dexterity_ of the 
implementation as we avoid several object creation, string reversal, and 
virtual method calls (a JMH might be needed anyway, as other maintainers also 
mentioned).
   Besides that, I think this refactor was, in fact, necessary to make the 
implementation as similar to the original algorithm as possible: the proof for 
space analysis of Algorithm C says "_We assume [...] and substrings can be 
transferred as arguments by giving initial and final locations_."
   I hope with this change, I have not interrupted your review process! Please 
let me know if adding more comments helps.
   
   > I think I prefer to keep a single algorithm for the sake of 
maintainability. And since C performs better, I'd keep just C.
   > What do you think?
   
   I think we cannot get rid of Algorithm B, as it is being used as an 
auxiliary procedure in implementing Algorithm C: please see step 3 of Algorithm 
C in the paper.
   But, I understand the maintainability concern. 
   How about doing this: we could keep **only** Algorithm B in this version to 
improve `apply` method only; we can keep the old implementation (dynamic 
programming) of the method `longestCommonSubsequence` for now.
   We can mention that improving space complexity of `longestCommonSubsequence` 
is a work in progress.
   
   But if we want to keep Algorithm C, we must keep Algorithm B as it is being 
called from Algorithm C.
   And in that case (in my opinion) it is ideal to use Algorithm B for `apply` 
method, because we already have the implementation and it involves less 
computation than Algorithm C.
   
   Please let me know if you prefer the former approach. In that case I can 
revert the code for `longestCommonSubsequence`.
   
   Thanks a lot!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@commons.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [commons-text] ali-ghanbari commented on a change in pull request #233: A More Efficient Implementation for Calculating Size of Longest Common Subsequence

Reply via email to