Re: RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

Joe Wang Fri, 21 Sep 2018 20:50:23 -0700


On 9/20/18, 2:46 PM, Stuart Marks wrote:

On 9/19/18 11:48 AM, Joe Wang wrote:
After much discussion and 10 iterations of reviews, this proposal hasevolved from what was the original isSameContent method to a mismatchmethod. API-wise, a compare method was also considered as it lookedlike just a short step forward from mismatch, however, it waseventually dropped since there is no convincing use case comparingfiles lexicographically by contents. Impl-wise, extensive performancebenchmarking has been done to compare a buffered reading vs memorymapping, the result was that a simple buffered reading performedbetter among small files, and those with the mismatched byte closerto the beginning of files. Since the proposed method's targeted filesare small ones, the impl currently does a buffered reading only.
Hi Joe,

Thanks for being persistent with this one!

Thanks for the help, and the "lot of stuff" with clear indicators (e.g.line numbers and etc.)! From JDK 11 to 12, hopefully wont be deferredto 13 :-)

A very small spec nitpick:
specdiff:http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff/java/nio/file/Files.html
1544 * <li> {@link #isSameFile(Path, Path) isSameFile(path,path2)} returns true,</li>
I would add "or" after the trailing comma. This makes it similar tothe two-item list that follows.


Done.

webrev: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev/
A couple minor comments on FilesMismatch.java:
If mismatchByAttrs() is replaced with a call to isSameFile(), as Alansuggested, this simplifies things considerably. It looks like themismatch() implementation will be reduced to around ~40 lines of code.Does it deserve its own file, or can it be placed directly intoFiles.java? That file has over 4,000 lines already though.

I merged the private methods after removing mismatchByAttrs(), thatreduced the code to 33 lines, and then with the change you suggestedbelow, it's further reduced to 29 lines. I'll put them in Files for now.

Files is too big, ~180K, that compares with the next one ~45K amongfiles in the nio/file package. Unfortunately, the great majority of themare specs, probably 90%.

 106             if (nRead1 == 0 || nRead2 == 0) {
 107                 if (nRead1 == 0 && nRead2 == 0) {
 108                     // both files reach EOF
 109                     return -1;
 110                 } else {
 111                     // one of the files reaches EOF
 112                     return totalRead;
 113                 }
 114             } else if (nRead1 != nRead2) {
I think it reads slightly better if the nested 'if' at lines 107-113were flattened into the else-if chain, like so:
            if (nRead1 == 0 && nRead2 == 0) {
                // both files reached EOF
                return -1;
            } else if (nRead1 == 0 || nRead2 == 0) {
                // one but not both files reached EOF
                return totalRead;
            } else if (nRead1 != nRead2) {
There are a couple places where lines can be joined. The resultinglines are longer than 80 characters but are still less than 100characters, which I think is OK. (I don't think we have a strict limitof 80 anymore.)
  97     private static long mismatch(InputStream in1, InputStream in2)
  98         throws IOException {

 117                 return totalRead +
118 Arrays.mismatch(buffer1, 0, len, buffer2, 0,len);


Done. 95 characters :-)

Comments on tests in Comparison.java:
* This file, and several names within this file, should probablyshould be renamed to focus on mismatch instead of comparison.


Will rename to "mismatch". Previously, it covers both mismatch and compare.

* I'm finding it quite difficult to read the test cases.
The test files are placed into a Path[] and referenced by index. Thismakes it quite difficult to determine, for example, if all six caseswithin the mismatch loop are covered.
Instead of storing the files at specific array indexes, I'd suggestdoing the following.
1) Establish good naming conventions for test files, including sizeinformation, covering duplicate files (e.g., 'a' and 'b'), and amodification at a particular offset. For example, you might dosomething like the following:
    test120a    - a test file with size 120
    test120b    - a test file with size 120, identical to test120a
    test120m110 - a test file with size 120, with an altered byte at 110
    etc.

I can do that. I had hoped the Assertion msg, e.g. "Both files areempty, no mismatch" would help. Along with renaming variables, I willalso add to the Assertion Msg with test case number, e.g. "Case#1: bothfiles are empty, return no mismatch".

2) Declare fields of the same name containing the file paths.
3) Add some logic in prepareTestFile() to append each file's path to alist after the file is created. Then have the @AfterClass method runthrough this list and delete the files.


May also create the test files in a folder and delete it @AfterClass.

4) The test data generator method can then have lines of test datathat reference the test files using a meaningful name, so it's easy tounderstand the names and the expected results. Also, the test datagenerator should have a comment that describes the test methodparameters.


Sure, will add comments and etc.

* The generation of the test data files seems overly complicated.
A StringBuilder is used to accumulate data, then it's turned into aString, then (optionally) a character is replaced by the insertChar()method, and then the string is written to the file.
I think you can just accumulate the test data into a StringBuilder,(optionally) call sb.setCharAt() to modify the data, and then writethe SB to the file. After all, Files.writeString() takes a CharSequence.
The generateFixedString() method should be changed to create andreturn a fresh StringBuilder with the desired contents. (See above.)It can also be simplified quite a bit.

Wouldn't it be nice if String has an utility method that generatesrandom ASCII strings? ;-) It can be faster with its internal byte array.

I don't think you need to have methods with Charset parameters. It'sreasonable to have the test files be text, which makes things easierto develop and debug. But the test depends on characters being writtenin a single-byte charset, because mismatch() reports *byte* positionsand not *character* positions.
Using the method overloads that are implicitly UTF-8 is incorrect. Ifa non-single-byte character were to slip into the test data (seemsunlikely, but possible), it'll be expanded to multiple bytes by theUTF-8 encoder, which may throw off the result. Instead I'd suggesthardwiring the charset to be US_ASCII to enforce theone-byte-per-character invariant.
This is a lot of stuff. If you can't follow it all, perhaps we canwork together directly and go over the test file.

I'll improve the test, and US_ASCII it is! I'll probably use a string offixed length to generate the longest string the test needs, and then getsubstrings for other test cases, that may simplify the process.


Thanks,
Joe


Thanks,

s'marks

Re: RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

Reply via email to