Re: RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

Weijun Wang Sun, 14 Oct 2018 18:50:39 -0700

Hi Joe

Two comments:


1. How about (path1, path2)? I take a look at other similar APIs, some use 
(c1,c2) and some (a,b).

2. Could the method be non-reflexive even if fs is non static? isSameFile(f,f) 
is always true.

+     * <p> This method may not be atomic with respect to other file system
+     * operations. If the file system and files remain static then this method
+     * is <i>reflexive</i> (for {@code Path} {@code f}, {@code mismatch(f,f)}
+     * returns {@code -1L})

Thanks
Max

> On Oct 13, 2018, at 3:16 AM, Joe Wang <huizhe.w...@oracle.com> wrote:
> 
> Hi all,
> 
> Here's an update based on all of the great reviews and comments (thanks all!):
> 
> JBS: https://bugs.openjdk.java.net/browse/JDK-8202285
> CSR: https://bugs.openjdk.java.net/browse/JDK-8202302
> 
> Current version:
> specdiff: 
> http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_02/java/nio/file/Files.html
> webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v02/
> 
> Previous version:
> specdiff: 
> http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_v01/java/nio/file/Files.html
> webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v01/
> 
> It's been a while, so here's a summary of what's changed:
> 
> Spec: Alan's patch to fix inconsistencies in wording/terminology
> Impl: s/mismatchByAttrs and etc/isSameFile;
>         Stick with the simple implementation (InputStream/Buffer) for now. 
> Further improvement may be done if there's a need;
>         The impl is smaller than the previous version, merged the code into 
> Files, removed FilesMismatch.java;
> 
> Test: more readable variables instead of the array of test files
>         more comments on the testcases
>         improved the private methods used for generating the test files
> 
> Thanks,
> Joe
> 
> On 9/21/18, 6:49 PM, Joe Wang wrote:
>> 
>> On 9/20/18, 2:46 PM, Stuart Marks wrote:
>>> 
>>> 
>>> On 9/19/18 11:48 AM, Joe Wang wrote:
>>>> After much discussion and 10 iterations of reviews, this proposal has 
>>>> evolved from what was the original isSameContent method to a mismatch 
>>>> method. API-wise, a compare method was also considered as it looked like 
>>>> just a short step forward from mismatch, however, it was eventually 
>>>> dropped since there is no convincing use case comparing files 
>>>> lexicographically by contents. Impl-wise, extensive performance 
>>>> benchmarking has been done to compare a buffered reading vs memory 
>>>> mapping, the result was that a simple buffered reading performed better 
>>>> among small files, and those with the mismatched byte closer to the 
>>>> beginning of files. Since the proposed method's targeted files are small 
>>>> ones, the impl currently does a buffered reading only.
>>> 
>>> Hi Joe,
>>> 
>>> Thanks for being persistent with this one!
>> 
>> Thanks for the help, and the "lot of stuff" with clear indicators (e.g. line 
>> numbers and etc.)!  From JDK 11 to 12, hopefully wont be deferred to 13 :-)
>>> 
>>> A very small spec nitpick:
>>> 
>>>> specdiff: 
>>>> http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff/java/nio/file/Files.html
>>> 
>>> 1544      * <li> {@link #isSameFile(Path, Path) isSameFile(path, path2)} 
>>> returns true,</li>
>>> 
>>> I would add "or" after the trailing comma. This makes it similar to the 
>>> two-item list that follows.
>> 
>> Done.
>>> 
>>>> webrev: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev/
>>> 
>>> A couple minor comments on FilesMismatch.java:
>>> 
>>> If mismatchByAttrs() is replaced with a call to isSameFile(), as Alan 
>>> suggested, this simplifies things considerably. It looks like the 
>>> mismatch() implementation will be reduced to around ~40 lines of code. Does 
>>> it deserve its own file, or can it be placed directly into Files.java? That 
>>> file has over 4,000 lines already though.
>> 
>> I merged the private methods after removing mismatchByAttrs(), that reduced 
>> the code to 33 lines, and then with the change you suggested below, it's 
>> further reduced to 29 lines. I'll put them in Files for now.
>> 
>> Files is too big,  ~180K, that compares with the next one ~45K among files 
>> in the nio/file package. Unfortunately, the great majority of them are 
>> specs, probably 90%.
>>> 
>>> 106             if (nRead1 == 0 || nRead2 == 0) {
>>> 107                 if (nRead1 == 0 && nRead2 == 0) {
>>> 108                     // both files reach EOF
>>> 109                     return -1;
>>> 110                 } else {
>>> 111                     // one of the files reaches EOF
>>> 112                     return totalRead;
>>> 113                 }
>>> 114             } else if (nRead1 != nRead2) {
>>> 
>>> I think it reads slightly better if the nested 'if' at lines 107-113 were 
>>> flattened into the else-if chain, like so:
>>> 
>>>            if (nRead1 == 0 && nRead2 == 0) {
>>>                // both files reached EOF
>>>                return -1;
>>>            } else if (nRead1 == 0 || nRead2 == 0) {
>>>                // one but not both files reached EOF
>>>                return totalRead;
>>>            } else if (nRead1 != nRead2) {
>>> 
>>> There are a couple places where lines can be joined. The resulting lines 
>>> are longer than 80 characters but are still less than 100 characters, which 
>>> I think is OK. (I don't think we have a strict limit of 80 anymore.)
>>> 
>>>  97     private static long mismatch(InputStream in1, InputStream in2)
>>>  98         throws IOException {
>>> 
>>> 117                 return totalRead +
>>> 118                     Arrays.mismatch(buffer1, 0, len, buffer2, 0, len);
>> 
>> Done. 95 characters :-)
>> 
>>> 
>>> Comments on tests in Comparison.java:
>>> 
>>> * This file, and several names within this file, should probably should be 
>>> renamed to focus on mismatch instead of comparison.
>> 
>> Will rename to "mismatch". Previously, it covers both mismatch and compare.
>>> 
>>> * I'm finding it quite difficult to read the test cases.
>>> 
>>> The test files are placed into a Path[] and referenced by index. This makes 
>>> it quite difficult to determine, for example, if all six cases within the 
>>> mismatch loop are covered.
>>> 
>>> Instead of storing the files at specific array indexes, I'd suggest doing 
>>> the following.
>>> 
>>> 1) Establish good naming conventions for test files, including size 
>>> information, covering duplicate files (e.g., 'a' and 'b'), and a 
>>> modification at a particular offset. For example, you might do something 
>>> like the following:
>>> 
>>>    test120a    - a test file with size 120
>>>    test120b    - a test file with size 120, identical to test120a
>>>    test120m110 - a test file with size 120, with an altered byte at 110
>>>    etc.
>> 
>> I can do that. I had hoped the Assertion msg, e.g. "Both files are empty, no 
>> mismatch" would help. Along with renaming variables, I will also add to the 
>> Assertion Msg with test case number, e.g. "Case#1: both files are empty, 
>> return no mismatch".
>>> 
>>> 2) Declare fields of the same name containing the file paths.
>>> 
>>> 3) Add some logic in prepareTestFile() to append each file's path to a list 
>>> after the file is created. Then have the @AfterClass method run through 
>>> this list and delete the files.
>> 
>> May also create the test files in a folder and delete it @AfterClass.
>>> 
>>> 4) The test data generator method can then have lines of test data that 
>>> reference the test files using a meaningful name, so it's easy to 
>>> understand the names and the expected results. Also, the test data 
>>> generator should have a comment that describes the test method parameters.
>> 
>> Sure, will add comments and etc.
>>> 
>>> * The generation of the test data files seems overly complicated.
>>> 
>>> A StringBuilder is used to accumulate data, then it's turned into a String, 
>>> then (optionally) a character is replaced by the insertChar() method, and 
>>> then the string is written to the file.
>>> 
>>> I think you can just accumulate the test data into a StringBuilder, 
>>> (optionally) call sb.setCharAt() to modify the data, and then write the SB 
>>> to the file. After all, Files.writeString() takes a CharSequence.
>>> 
>>> The generateFixedString() method should be changed to create and return a 
>>> fresh StringBuilder with the desired contents. (See above.) It can also be 
>>> simplified quite a bit.
>> 
>> Wouldn't it be nice if String has an utility method that generates random 
>> ASCII strings? ;-)  It can be faster with its internal byte array.
>>> 
>>> I don't think you need to have methods with Charset parameters. It's 
>>> reasonable to have the test files be text, which makes things easier to 
>>> develop and debug. But the test depends on characters being written in a 
>>> single-byte charset, because mismatch() reports *byte* positions and not 
>>> *character* positions.
>>> 
>>> Using the method overloads that are implicitly UTF-8 is incorrect. If a 
>>> non-single-byte character were to slip into the test data (seems unlikely, 
>>> but possible), it'll be expanded to multiple bytes by the UTF-8 encoder, 
>>> which may throw off the result. Instead I'd suggest hardwiring the charset 
>>> to be US_ASCII to enforce the one-byte-per-character invariant.
>>> 
>>> This is a lot of stuff. If you can't follow it all, perhaps we can work 
>>> together directly and go over the test file.
>> 
>> I'll improve the test, and US_ASCII it is! I'll probably use a string of 
>> fixed length to generate the longest string the test needs, and then get 
>> substrings for other test cases, that may simplify the process.
>> 
>> Thanks,
>> Joe
>> 
>>> 
>>> Thanks,
>>> 
>>> s'marks

Re: RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

Reply via email to