[GitHub] [commons-text] kinow commented on a change in pull request #233: [TEXT-212] A More Efficient Implementation for Calculating Size of Longest Common Subsequence

2021-08-05 Thread GitBox


kinow commented on a change in pull request #233:
URL: https://github.com/apache/commons-text/pull/233#discussion_r683855155



##
File path: 
src/main/java/org/apache/commons/text/similarity/LongestCommonSubsequence.java
##
@@ -29,36 +29,109 @@
  * 
  *
  * 
- * This implementation is based on the Longest Commons Substring algorithm
- * from https://en.wikipedia.org/wiki/Longest_common_subsequence_problem;>
- * https://en.wikipedia.org/wiki/Longest_common_subsequence_problem.
+ * As of version 1.10, a more space-efficient of the algorithm is implemented. 
The new algorithm has linear space
+ * complexity instead of quadratic. However, time complexity is still 
quadratic in the size of input strings.
  * 
  *
- * For further reading see:
+ * 
+ * The implementation is based on Hirschberg's Longest Commons Substring 
algorithm (cited below).
+ * 
  *
- * Lothaire, M. Applied combinatorics on words. New York: Cambridge 
U Press, 2005. 12-13
+ * For further reading see:
+ * 
+ * 
+ * Lothaire, M. Applied combinatorics on words. New York: Cambridge U 
Press, 2005. 12-13
+ * 
+ * 
+ * D. S. Hirschberg, "A linear space algorithm for computing maximal common 
subsequences," CACM, 1975, pp. 341--343.
+ * 
+ * 
  *
  * @since 1.0
  */
 public class LongestCommonSubsequence implements SimilarityScore {
-
 /**
- * Calculates longest common subsequence similarity score of two {@code 
CharSequence}'s passed as
+ * Calculates the longest common subsequence similarity score of two 
{@code CharSequence}'s passed as
  * input.
  *
- * @param left first character sequence
- * @param right second character sequence
- * @return longestCommonSubsequenceLength
- * @throws IllegalArgumentException
- * if either String input {@code null}
+ * 
+ * This method implements a more efficient version of LCS algorithm which 
has quadratic time and
+ * linear space complexity.
+ * 
+ *
+ * 
+ * This method is based on newly implemented {@link 
#algorithmB(CharSequence, CharSequence)}.
+ * An evaluation using JMH revealed that this method is almost two times 
faster than its previous version.
+ * 
+ *
+ * @param left First character sequence
+ * @param right Second character sequence
+ * @return Length of the longest common subsequence of left 
and right
+ * @throws IllegalArgumentException if either String input {@code null}
  */
 @Override
 public Integer apply(final CharSequence left, final CharSequence right) {
 // Quick return for invalid inputs
 if (left == null || right == null) {
 throw new IllegalArgumentException("Inputs must not be null");
 }
-return longestCommonSubsequence(left, right).length();
+// Find lengths of two strings
+final int leftSz = left.length();
+final int rightSz = right.length();
+
+// Check if we can avoid calling algorithmB which involves heap space 
allocation
+if (leftSz == 0 || rightSz == 0) {
+return 0;
+}
+
+// Check if we can save even more space
+if (leftSz < rightSz) {
+return algorithmB(right, left)[leftSz];
+}
+return algorithmB(left, right)[rightSz];
+}
+
+/**
+ * An implementation of "ALG B" from Hirschberg's CACM '71 paper.
+ * Assuming the first input sequence is of size m and the 
second input sequence is of size
+ * n, this method returns the last row of the dynamic 
programming (DP) table when calculating
+ * the LCS of the two sequences in O(m*n) time and O(n) 
space.

Review comment:
   Although `algorithmC` does similar trick, but in the javadocs we say 
`O(m+n) space`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@commons.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [commons-text] kinow commented on a change in pull request #233: [TEXT-212] A More Efficient Implementation for Calculating Size of Longest Common Subsequence

2021-08-05 Thread GitBox


kinow commented on a change in pull request #233:
URL: https://github.com/apache/commons-text/pull/233#discussion_r683853652



##
File path: 
src/main/java/org/apache/commons/text/similarity/LongestCommonSubsequence.java
##
@@ -29,36 +29,109 @@
  * 
  *
  * 
- * This implementation is based on the Longest Commons Substring algorithm
- * from https://en.wikipedia.org/wiki/Longest_common_subsequence_problem;>
- * https://en.wikipedia.org/wiki/Longest_common_subsequence_problem.
+ * As of version 1.10, a more space-efficient of the algorithm is implemented. 
The new algorithm has linear space
+ * complexity instead of quadratic. However, time complexity is still 
quadratic in the size of input strings.
  * 
  *
- * For further reading see:
+ * 
+ * The implementation is based on Hirschberg's Longest Commons Substring 
algorithm (cited below).
+ * 
  *
- * Lothaire, M. Applied combinatorics on words. New York: Cambridge 
U Press, 2005. 12-13
+ * For further reading see:
+ * 
+ * 
+ * Lothaire, M. Applied combinatorics on words. New York: Cambridge U 
Press, 2005. 12-13
+ * 
+ * 
+ * D. S. Hirschberg, "A linear space algorithm for computing maximal common 
subsequences," CACM, 1975, pp. 341--343.
+ * 
+ * 
  *
  * @since 1.0
  */
 public class LongestCommonSubsequence implements SimilarityScore {
-
 /**
- * Calculates longest common subsequence similarity score of two {@code 
CharSequence}'s passed as
+ * Calculates the longest common subsequence similarity score of two 
{@code CharSequence}'s passed as
  * input.
  *
- * @param left first character sequence
- * @param right second character sequence
- * @return longestCommonSubsequenceLength
- * @throws IllegalArgumentException
- * if either String input {@code null}
+ * 
+ * This method implements a more efficient version of LCS algorithm which 
has quadratic time and
+ * linear space complexity.
+ * 
+ *
+ * 
+ * This method is based on newly implemented {@link 
#algorithmB(CharSequence, CharSequence)}.
+ * An evaluation using JMH revealed that this method is almost two times 
faster than its previous version.
+ * 
+ *
+ * @param left First character sequence
+ * @param right Second character sequence
+ * @return Length of the longest common subsequence of left 
and right
+ * @throws IllegalArgumentException if either String input {@code null}
  */
 @Override
 public Integer apply(final CharSequence left, final CharSequence right) {
 // Quick return for invalid inputs
 if (left == null || right == null) {
 throw new IllegalArgumentException("Inputs must not be null");
 }
-return longestCommonSubsequence(left, right).length();
+// Find lengths of two strings
+final int leftSz = left.length();
+final int rightSz = right.length();
+
+// Check if we can avoid calling algorithmB which involves heap space 
allocation
+if (leftSz == 0 || rightSz == 0) {
+return 0;
+}
+
+// Check if we can save even more space
+if (leftSz < rightSz) {
+return algorithmB(right, left)[leftSz];
+}
+return algorithmB(left, right)[rightSz];
+}
+
+/**
+ * An implementation of "ALG B" from Hirschberg's CACM '71 paper.
+ * Assuming the first input sequence is of size m and the 
second input sequence is of size
+ * n, this method returns the last row of the dynamic 
programming (DP) table when calculating
+ * the LCS of the two sequences in O(m*n) time and O(n) 
space.

Review comment:
   Oh, I think because you are always passing the smallest as the second 
sequence, then it'll be `O(n)`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@commons.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [commons-text] kinow commented on a change in pull request #233: [TEXT-212] A More Efficient Implementation for Calculating Size of Longest Common Subsequence

2021-08-05 Thread GitBox


kinow commented on a change in pull request #233:
URL: https://github.com/apache/commons-text/pull/233#discussion_r683851199



##
File path: 
src/main/java/org/apache/commons/text/similarity/LongestCommonSubsequence.java
##
@@ -29,36 +29,109 @@
  * 
  *
  * 
- * This implementation is based on the Longest Commons Substring algorithm
- * from https://en.wikipedia.org/wiki/Longest_common_subsequence_problem;>
- * https://en.wikipedia.org/wiki/Longest_common_subsequence_problem.
+ * As of version 1.10, a more space-efficient of the algorithm is implemented. 
The new algorithm has linear space
+ * complexity instead of quadratic. However, time complexity is still 
quadratic in the size of input strings.
  * 
  *
- * For further reading see:
+ * 
+ * The implementation is based on Hirschberg's Longest Commons Substring 
algorithm (cited below).
+ * 
  *
- * Lothaire, M. Applied combinatorics on words. New York: Cambridge 
U Press, 2005. 12-13
+ * For further reading see:
+ * 
+ * 
+ * Lothaire, M. Applied combinatorics on words. New York: Cambridge U 
Press, 2005. 12-13
+ * 
+ * 
+ * D. S. Hirschberg, "A linear space algorithm for computing maximal common 
subsequences," CACM, 1975, pp. 341--343.
+ * 
+ * 
  *
  * @since 1.0
  */
 public class LongestCommonSubsequence implements SimilarityScore {
-
 /**
- * Calculates longest common subsequence similarity score of two {@code 
CharSequence}'s passed as
+ * Calculates the longest common subsequence similarity score of two 
{@code CharSequence}'s passed as
  * input.
  *
- * @param left first character sequence
- * @param right second character sequence
- * @return longestCommonSubsequenceLength
- * @throws IllegalArgumentException
- * if either String input {@code null}
+ * 
+ * This method implements a more efficient version of LCS algorithm which 
has quadratic time and
+ * linear space complexity.
+ * 
+ *
+ * 
+ * This method is based on newly implemented {@link 
#algorithmB(CharSequence, CharSequence)}.
+ * An evaluation using JMH revealed that this method is almost two times 
faster than its previous version.
+ * 
+ *
+ * @param left First character sequence
+ * @param right Second character sequence
+ * @return Length of the longest common subsequence of left 
and right
+ * @throws IllegalArgumentException if either String input {@code null}
  */
 @Override
 public Integer apply(final CharSequence left, final CharSequence right) {
 // Quick return for invalid inputs
 if (left == null || right == null) {
 throw new IllegalArgumentException("Inputs must not be null");
 }
-return longestCommonSubsequence(left, right).length();
+// Find lengths of two strings
+final int leftSz = left.length();
+final int rightSz = right.length();
+
+// Check if we can avoid calling algorithmB which involves heap space 
allocation
+if (leftSz == 0 || rightSz == 0) {
+return 0;
+}
+
+// Check if we can save even more space
+if (leftSz < rightSz) {
+return algorithmB(right, left)[leftSz];
+}
+return algorithmB(left, right)[rightSz];
+}
+
+/**
+ * An implementation of "ALG B" from Hirschberg's CACM '71 paper.
+ * Assuming the first input sequence is of size m and the 
second input sequence is of size
+ * n, this method returns the last row of the dynamic 
programming (DP) table when calculating
+ * the LCS of the two sequences in O(m*n) time and O(n) 
space.

Review comment:
   @ali-ghanbari 
   
   >and O(n) space
   
   Isn't that `O(m+n)`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@commons.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [commons-text] kinow commented on a change in pull request #233: [TEXT-212] A More Efficient Implementation for Calculating Size of Longest Common Subsequence

2021-08-05 Thread GitBox


kinow commented on a change in pull request #233:
URL: https://github.com/apache/commons-text/pull/233#discussion_r683849914



##
File path: 
src/test/java/org/apache/commons/text/jmh/LongestCommonSubsequencePerformance.java
##
@@ -0,0 +1,166 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.text.jmh;
+
+import org.apache.commons.lang3.tuple.ImmutablePair;
+import org.apache.commons.lang3.tuple.Pair;
+import org.apache.commons.text.similarity.LongestCommonSubsequence;
+import org.apache.commons.text.similarity.SimilarityScore;
+import org.openjdk.jmh.annotations.Benchmark;
+import org.openjdk.jmh.annotations.BenchmarkMode;
+import org.openjdk.jmh.annotations.Fork;
+import org.openjdk.jmh.annotations.Level;
+import org.openjdk.jmh.annotations.Measurement;
+import org.openjdk.jmh.annotations.Mode;
+import org.openjdk.jmh.annotations.OutputTimeUnit;
+import org.openjdk.jmh.annotations.Scope;
+import org.openjdk.jmh.annotations.Setup;
+import org.openjdk.jmh.annotations.State;
+import org.openjdk.jmh.annotations.Warmup;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Objects;
+import java.util.concurrent.TimeUnit;
+
+/**
+ * Performance analysis for LongestCommonSubsequence
+ */
+@BenchmarkMode(Mode.AverageTime)
+@OutputTimeUnit(TimeUnit.MILLISECONDS)
+@Warmup(iterations = 5, time = 1)
+@Measurement(iterations = 5, time = 1)
+@Fork(value = 1, jvmArgs = {"-server", "-Xms512M", "-Xmx512M"})
+public class LongestCommonSubsequencePerformance {
+@State(Scope.Benchmark)
+public static class InputData {
+final List> inputs = new 
ArrayList<>();
+
+@Setup(Level.Trial)
+public void setup() {
+final ClassLoader classloader = 
Thread.currentThread().getContextClassLoader();
+try (InputStream is = 
classloader.getResourceAsStream("lcs-perf-analysis-inputs.csv");
+ InputStreamReader isr = new 
InputStreamReader(Objects.requireNonNull(is));
+ BufferedReader br = new BufferedReader(isr)) {
+String line;
+while ((line = br.readLine()) != null && 
!line.trim().isEmpty()) {
+line = line.trim();

Review comment:
   Checkstyle was complaining about re-assigning `line` in the `while` 
block, so I left the double `line.trim` call, which shouldn't really interfere 
with the performance tests :+1: 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@commons.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [commons-text] kinow commented on a change in pull request #233: [TEXT-212] A More Efficient Implementation for Calculating Size of Longest Common Subsequence

2021-08-05 Thread GitBox


kinow commented on a change in pull request #233:
URL: https://github.com/apache/commons-text/pull/233#discussion_r683839207



##
File path: src/test/resources/lcs-perf-analysis-inputs.csv
##
@@ -0,0 +1,5 @@
+"This code is free software; you can redistribute it and/or modify it","under 
the terms of the GNU General Public License version 2 only, as"
+"You should have received a copy of the GNU General Public License version","2 
along with this work; if not, write to the Free Software Foundation,"
+"Here, the field iterations will be populated with appropriate values from the 
@Param annotation by the JMH when it is passed to the benchmark method. The 
@Setup annotated method is invoked before each invocation of the benchmark and 
creates a new Hasher ensuring isolation. When the execution is finished, we'll 
get a result similar to the one below: When running microbenchmarks, it's very 
important to be aware of optimizations. Otherwise, they may affect 
the","benchmark results in a very misleading way. To make matters a bit more 
concrete, let's consider an example: We expect object allocation costs more 
than doing nothing at all. However, if we run the benchmarks: Apparently 
finding a place in the TLAB, creating and initializing an object is almost 
free! Just by looking at these numbers, we should know that something does not 
quite add up here."
+"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod 
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, 
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo 
consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse 
cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non 
proident, sunt in culpa qui officia deserunt mollit anim id est laborum. This 
code is free software; you can redistribute it and/or modify it published by 
the Free Software Foundation.  Oracle designates this This code is distributed 
in the hope that it will be useful, but WITHOUT FITNESS FOR A PARTICULAR 
PURPOSE.  See the GNU General Public License You should have received a copy of 
the GNU General Public License version","Sed ut perspiciatis unde omnis iste 
natus error sit voluptatem accusantium doloremque laudantium, totam rem 
aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae 
vitae d
 icta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur 
aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione 
voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia 
dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi 
tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim 
ad minima veniam, quis nostrum exercitationem ullam corporis suscipit 
laboriosam, nisi ut aliquid ex ea commodi consequatur? Quis autem vel eum iure 
reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, 
vel illum qui dolorem eum fugiat quo voluptas nulla pariatur? under the terms 
of the GNU General Public License version 2 only, as particular file as subject 
to the *Classpath* exception as provided ANY WARRANTY; without even the implied 
warranty of MERCHANTABILITY or version 2 for more details (a copy is included 
in the LICENSE file that 2 along with this work; if not, write to the Fr
 ee Software Foundation,"
+"But I must explain to you how all this mistaken idea of denouncing pleasure 
and praising pain was born and I will give you a complete account of the 
system, and expound the actual teachings of the great explorer of the truth, 
the master-builder of human happiness. No one rejects, dislikes, or avoids 
pleasure itself, because it is pleasure, but because those who do not know how 
to pursue pleasure rationally encounter consequences that are extremely 
painful. Nor again is there anyone who loves or pursues or desires to obtain 
pain of itself, because it is pain, but because occasionally circumstances 
occur in which toil and pain can procure him some great pleasure. To take a 
trivial example, which of us ever undertakes laborious physical exercise, 
except to obtain some advantage from it? But who has any right to find fault 
with a man who chooses to enjoy a pleasure that has no annoying consequences, 
or one who avoids a pain that produces no resultant pleasure? Here, the field 
iteration
 s will be populated with appropriate values from the @Param annotation by the 
JMH when it is passed to the benchmark method. The @Setup annotated method is 
invoked before each invocation of the benchmark and creates a new Hasher 
ensuring isolation. When the execution is finished, we'll get a result similar 
to the one below: When running microbenchmarks, it's very important to be aware 
of optimizations. Otherwise, they may affect the","At vero eos et accusamus et 
iusto odio dignissimos ducimus qui blanditiis praesentium voluptatum