Hi Joe Two comments:
1. How about (path1, path2)? I take a look at other similar APIs, some use (c1,c2) and some (a,b). 2. Could the method be non-reflexive even if fs is non static? isSameFile(f,f) is always true. + * <p> This method may not be atomic with respect to other file system + * operations. If the file system and files remain static then this method + * is <i>reflexive</i> (for {@code Path} {@code f}, {@code mismatch(f,f)} + * returns {@code -1L}) Thanks Max > On Oct 13, 2018, at 3:16 AM, Joe Wang <huizhe.w...@oracle.com> wrote: > > Hi all, > > Here's an update based on all of the great reviews and comments (thanks all!): > > JBS: https://bugs.openjdk.java.net/browse/JDK-8202285 > CSR: https://bugs.openjdk.java.net/browse/JDK-8202302 > > Current version: > specdiff: > http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_02/java/nio/file/Files.html > webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v02/ > > Previous version: > specdiff: > http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_v01/java/nio/file/Files.html > webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v01/ > > It's been a while, so here's a summary of what's changed: > > Spec: Alan's patch to fix inconsistencies in wording/terminology > Impl: s/mismatchByAttrs and etc/isSameFile; > Stick with the simple implementation (InputStream/Buffer) for now. > Further improvement may be done if there's a need; > The impl is smaller than the previous version, merged the code into > Files, removed FilesMismatch.java; > > Test: more readable variables instead of the array of test files > more comments on the testcases > improved the private methods used for generating the test files > > Thanks, > Joe > > On 9/21/18, 6:49 PM, Joe Wang wrote: >> >> On 9/20/18, 2:46 PM, Stuart Marks wrote: >>> >>> >>> On 9/19/18 11:48 AM, Joe Wang wrote: >>>> After much discussion and 10 iterations of reviews, this proposal has >>>> evolved from what was the original isSameContent method to a mismatch >>>> method. API-wise, a compare method was also considered as it looked like >>>> just a short step forward from mismatch, however, it was eventually >>>> dropped since there is no convincing use case comparing files >>>> lexicographically by contents. Impl-wise, extensive performance >>>> benchmarking has been done to compare a buffered reading vs memory >>>> mapping, the result was that a simple buffered reading performed better >>>> among small files, and those with the mismatched byte closer to the >>>> beginning of files. Since the proposed method's targeted files are small >>>> ones, the impl currently does a buffered reading only. >>> >>> Hi Joe, >>> >>> Thanks for being persistent with this one! >> >> Thanks for the help, and the "lot of stuff" with clear indicators (e.g. line >> numbers and etc.)! From JDK 11 to 12, hopefully wont be deferred to 13 :-) >>> >>> A very small spec nitpick: >>> >>>> specdiff: >>>> http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff/java/nio/file/Files.html >>> >>> 1544 * <li> {@link #isSameFile(Path, Path) isSameFile(path, path2)} >>> returns true,</li> >>> >>> I would add "or" after the trailing comma. This makes it similar to the >>> two-item list that follows. >> >> Done. >>> >>>> webrev: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev/ >>> >>> A couple minor comments on FilesMismatch.java: >>> >>> If mismatchByAttrs() is replaced with a call to isSameFile(), as Alan >>> suggested, this simplifies things considerably. It looks like the >>> mismatch() implementation will be reduced to around ~40 lines of code. Does >>> it deserve its own file, or can it be placed directly into Files.java? That >>> file has over 4,000 lines already though. >> >> I merged the private methods after removing mismatchByAttrs(), that reduced >> the code to 33 lines, and then with the change you suggested below, it's >> further reduced to 29 lines. I'll put them in Files for now. >> >> Files is too big, ~180K, that compares with the next one ~45K among files >> in the nio/file package. Unfortunately, the great majority of them are >> specs, probably 90%. >>> >>> 106 if (nRead1 == 0 || nRead2 == 0) { >>> 107 if (nRead1 == 0 && nRead2 == 0) { >>> 108 // both files reach EOF >>> 109 return -1; >>> 110 } else { >>> 111 // one of the files reaches EOF >>> 112 return totalRead; >>> 113 } >>> 114 } else if (nRead1 != nRead2) { >>> >>> I think it reads slightly better if the nested 'if' at lines 107-113 were >>> flattened into the else-if chain, like so: >>> >>> if (nRead1 == 0 && nRead2 == 0) { >>> // both files reached EOF >>> return -1; >>> } else if (nRead1 == 0 || nRead2 == 0) { >>> // one but not both files reached EOF >>> return totalRead; >>> } else if (nRead1 != nRead2) { >>> >>> There are a couple places where lines can be joined. The resulting lines >>> are longer than 80 characters but are still less than 100 characters, which >>> I think is OK. (I don't think we have a strict limit of 80 anymore.) >>> >>> 97 private static long mismatch(InputStream in1, InputStream in2) >>> 98 throws IOException { >>> >>> 117 return totalRead + >>> 118 Arrays.mismatch(buffer1, 0, len, buffer2, 0, len); >> >> Done. 95 characters :-) >> >>> >>> Comments on tests in Comparison.java: >>> >>> * This file, and several names within this file, should probably should be >>> renamed to focus on mismatch instead of comparison. >> >> Will rename to "mismatch". Previously, it covers both mismatch and compare. >>> >>> * I'm finding it quite difficult to read the test cases. >>> >>> The test files are placed into a Path[] and referenced by index. This makes >>> it quite difficult to determine, for example, if all six cases within the >>> mismatch loop are covered. >>> >>> Instead of storing the files at specific array indexes, I'd suggest doing >>> the following. >>> >>> 1) Establish good naming conventions for test files, including size >>> information, covering duplicate files (e.g., 'a' and 'b'), and a >>> modification at a particular offset. For example, you might do something >>> like the following: >>> >>> test120a - a test file with size 120 >>> test120b - a test file with size 120, identical to test120a >>> test120m110 - a test file with size 120, with an altered byte at 110 >>> etc. >> >> I can do that. I had hoped the Assertion msg, e.g. "Both files are empty, no >> mismatch" would help. Along with renaming variables, I will also add to the >> Assertion Msg with test case number, e.g. "Case#1: both files are empty, >> return no mismatch". >>> >>> 2) Declare fields of the same name containing the file paths. >>> >>> 3) Add some logic in prepareTestFile() to append each file's path to a list >>> after the file is created. Then have the @AfterClass method run through >>> this list and delete the files. >> >> May also create the test files in a folder and delete it @AfterClass. >>> >>> 4) The test data generator method can then have lines of test data that >>> reference the test files using a meaningful name, so it's easy to >>> understand the names and the expected results. Also, the test data >>> generator should have a comment that describes the test method parameters. >> >> Sure, will add comments and etc. >>> >>> * The generation of the test data files seems overly complicated. >>> >>> A StringBuilder is used to accumulate data, then it's turned into a String, >>> then (optionally) a character is replaced by the insertChar() method, and >>> then the string is written to the file. >>> >>> I think you can just accumulate the test data into a StringBuilder, >>> (optionally) call sb.setCharAt() to modify the data, and then write the SB >>> to the file. After all, Files.writeString() takes a CharSequence. >>> >>> The generateFixedString() method should be changed to create and return a >>> fresh StringBuilder with the desired contents. (See above.) It can also be >>> simplified quite a bit. >> >> Wouldn't it be nice if String has an utility method that generates random >> ASCII strings? ;-) It can be faster with its internal byte array. >>> >>> I don't think you need to have methods with Charset parameters. It's >>> reasonable to have the test files be text, which makes things easier to >>> develop and debug. But the test depends on characters being written in a >>> single-byte charset, because mismatch() reports *byte* positions and not >>> *character* positions. >>> >>> Using the method overloads that are implicitly UTF-8 is incorrect. If a >>> non-single-byte character were to slip into the test data (seems unlikely, >>> but possible), it'll be expanded to multiple bytes by the UTF-8 encoder, >>> which may throw off the result. Instead I'd suggest hardwiring the charset >>> to be US_ASCII to enforce the one-byte-per-character invariant. >>> >>> This is a lot of stuff. If you can't follow it all, perhaps we can work >>> together directly and go over the test file. >> >> I'll improve the test, and US_ASCII it is! I'll probably use a string of >> fixed length to generate the longest string the test needs, and then get >> substrings for other test cases, that may simplify the process. >> >> Thanks, >> Joe >> >>> >>> Thanks, >>> >>> s'marks