Re: RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

Joe Wang Mon, 15 Oct 2018 15:16:07 -0700



On 10/14/18, 6:48 PM, Weijun Wang wrote:

Hi Joe

Two comments:

1. How about (path1, path2)? I take a look at other similar APIs, some use 
(c1,c2) and some (a,b).

It's true we have some inconsistencies there. Arrays.mismatch forexample, named the parameters (a,b). Arguably though, they should havebeen (a, a2) if they wanted to maintain consistency with then existing"equals" methods. In our case, since Files.isSameFile was (path, path2),Files.mismatch(path, path2) is desirable since it is extending theformer's functionality.


2. Could the method be non-reflexive even if fs is non static? isSameFile(f,f) 
is always true.

+     *<p>  This method may not be atomic with respect to other file system
+     * operations. If the file system and files remain static then this method
+     * is<i>reflexive</i>  (for {@code Path} {@code f}, {@code mismatch(f,f)}
+     * returns {@code -1L})

You're right. The javadoc is updated now with the If statement moved tocover the symmetric case only.


Current:

specdiff:http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_v03/java/nio/file/Files.html

webrev: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v03/

previous:

specdiff:http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_02/java/nio/file/Files.html

webrev: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v02/

Thanks,
Joe


Thanks
Max

On Oct 13, 2018, at 3:16 AM, Joe Wang<huizhe.w...@oracle.com>  wrote:

Hi all,

Here's an update based on all of the great reviews and comments (thanks all!):

JBS: https://bugs.openjdk.java.net/browse/JDK-8202285
CSR: https://bugs.openjdk.java.net/browse/JDK-8202302

Current version:
specdiff: 
http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_02/java/nio/file/Files.html
webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v02/

Previous version:
specdiff: 
http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_v01/java/nio/file/Files.html
webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v01/

It's been a while, so here's a summary of what's changed:

Spec: Alan's patch to fix inconsistencies in wording/terminology
Impl: s/mismatchByAttrs and etc/isSameFile;
         Stick with the simple implementation (InputStream/Buffer) for now. 
Further improvement may be done if there's a need;
         The impl is smaller than the previous version, merged the code into 
Files, removed FilesMismatch.java;

Test: more readable variables instead of the array of test files
         more comments on the testcases
         improved the private methods used for generating the test files

Thanks,
Joe

On 9/21/18, 6:49 PM, Joe Wang wrote:

On 9/20/18, 2:46 PM, Stuart Marks wrote:


On 9/19/18 11:48 AM, Joe Wang wrote:

After much discussion and 10 iterations of reviews, this proposal has evolved 
from what was the original isSameContent method to a mismatch method. API-wise, 
a compare method was also considered as it looked like just a short step 
forward from mismatch, however, it was eventually dropped since there is no 
convincing use case comparing files lexicographically by contents. Impl-wise, 
extensive performance benchmarking has been done to compare a buffered reading 
vs memory mapping, the result was that a simple buffered reading performed 
better among small files, and those with the mismatched byte closer to the 
beginning of files. Since the proposed method's targeted files are small ones, 
the impl currently does a buffered reading only.

Hi Joe,

Thanks for being persistent with this one!

Thanks for the help, and the "lot of stuff" with clear indicators (e.g. line 
numbers and etc.)!  From JDK 11 to 12, hopefully wont be deferred to 13 :-)

A very small spec nitpick:

specdiff: 
http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff/java/nio/file/Files.html

1544      *<li>  {@link #isSameFile(Path, Path) isSameFile(path, path2)} returns 
true,</li>

I would add "or" after the trailing comma. This makes it similar to the 
two-item list that follows.

Done.

webrev: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev/

A couple minor comments on FilesMismatch.java:

If mismatchByAttrs() is replaced with a call to isSameFile(), as Alan 
suggested, this simplifies things considerably. It looks like the mismatch() 
implementation will be reduced to around ~40 lines of code. Does it deserve its 
own file, or can it be placed directly into Files.java? That file has over 
4,000 lines already though.

I merged the private methods after removing mismatchByAttrs(), that reduced the 
code to 33 lines, and then with the change you suggested below, it's further 
reduced to 29 lines. I'll put them in Files for now.

Files is too big,  ~180K, that compares with the next one ~45K among files in 
the nio/file package. Unfortunately, the great majority of them are specs, 
probably 90%.

106             if (nRead1 == 0 || nRead2 == 0) {
107                 if (nRead1 == 0&&  nRead2 == 0) {
108                     // both files reach EOF
109                     return -1;
110                 } else {
111                     // one of the files reaches EOF
112                     return totalRead;
113                 }
114             } else if (nRead1 != nRead2) {

I think it reads slightly better if the nested 'if' at lines 107-113 were 
flattened into the else-if chain, like so:

            if (nRead1 == 0&&  nRead2 == 0) {
                // both files reached EOF
                return -1;
            } else if (nRead1 == 0 || nRead2 == 0) {
                // one but not both files reached EOF
                return totalRead;
            } else if (nRead1 != nRead2) {

There are a couple places where lines can be joined. The resulting lines are 
longer than 80 characters but are still less than 100 characters, which I think 
is OK. (I don't think we have a strict limit of 80 anymore.)

  97     private static long mismatch(InputStream in1, InputStream in2)
  98         throws IOException {

117                 return totalRead +
118                     Arrays.mismatch(buffer1, 0, len, buffer2, 0, len);

Done. 95 characters :-)

Comments on tests in Comparison.java:

* This file, and several names within this file, should probably should be 
renamed to focus on mismatch instead of comparison.

Will rename to "mismatch". Previously, it covers both mismatch and compare.

* I'm finding it quite difficult to read the test cases.

The test files are placed into a Path[] and referenced by index. This makes it 
quite difficult to determine, for example, if all six cases within the mismatch 
loop are covered.

Instead of storing the files at specific array indexes, I'd suggest doing the 
following.

1) Establish good naming conventions for test files, including size 
information, covering duplicate files (e.g., 'a' and 'b'), and a modification 
at a particular offset. For example, you might do something like the following:

    test120a    - a test file with size 120
    test120b    - a test file with size 120, identical to test120a
    test120m110 - a test file with size 120, with an altered byte at 110
    etc.

I can do that. I had hoped the Assertion msg, e.g. "Both files are empty, no mismatch" 
would help. Along with renaming variables, I will also add to the Assertion Msg with test case 
number, e.g. "Case#1: both files are empty, return no mismatch".

2) Declare fields of the same name containing the file paths.

3) Add some logic in prepareTestFile() to append each file's path to a list 
after the file is created. Then have the @AfterClass method run through this 
list and delete the files.

May also create the test files in a folder and delete it @AfterClass.

4) The test data generator method can then have lines of test data that 
reference the test files using a meaningful name, so it's easy to understand 
the names and the expected results. Also, the test data generator should have a 
comment that describes the test method parameters.

Sure, will add comments and etc.

* The generation of the test data files seems overly complicated.

A StringBuilder is used to accumulate data, then it's turned into a String, 
then (optionally) a character is replaced by the insertChar() method, and then 
the string is written to the file.

I think you can just accumulate the test data into a StringBuilder, 
(optionally) call sb.setCharAt() to modify the data, and then write the SB to 
the file. After all, Files.writeString() takes a CharSequence.

The generateFixedString() method should be changed to create and return a fresh 
StringBuilder with the desired contents. (See above.) It can also be simplified 
quite a bit.

Wouldn't it be nice if String has an utility method that generates random ASCII 
strings? ;-)  It can be faster with its internal byte array.

I don't think you need to have methods with Charset parameters. It's reasonable 
to have the test files be text, which makes things easier to develop and debug. 
But the test depends on characters being written in a single-byte charset, 
because mismatch() reports *byte* positions and not *character* positions.

Using the method overloads that are implicitly UTF-8 is incorrect. If a 
non-single-byte character were to slip into the test data (seems unlikely, but 
possible), it'll be expanded to multiple bytes by the UTF-8 encoder, which may 
throw off the result. Instead I'd suggest hardwiring the charset to be US_ASCII 
to enforce the one-byte-per-character invariant.

This is a lot of stuff. If you can't follow it all, perhaps we can work 
together directly and go over the test file.

I'll improve the test, and US_ASCII it is! I'll probably use a string of fixed 
length to generate the longest string the test needs, and then get substrings 
for other test cases, that may simplify the process.

Thanks,
Joe

Thanks,

s'marks

Re: RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

Reply via email to