Hi Joe
Two comments:
1. How about (path1, path2)? I take a look at other similar APIs, some use
(c1,c2) and some (a,b).
2. Could the method be non-reflexive even if fs is non static? isSameFile(f,f)
is always true.
+ * <p> This method may not be atomic with respect to other file system
+ * operations. If the file system and files remain static then this method
+ * is <i>reflexive</i> (for {@code Path} {@code f}, {@code mismatch(f,f)}
+ * returns {@code -1L})
Thanks
Max
> On Oct 13, 2018, at 3:16 AM, Joe Wang <[email protected]> wrote:
>
> Hi all,
>
> Here's an update based on all of the great reviews and comments (thanks all!):
>
> JBS: https://bugs.openjdk.java.net/browse/JDK-8202285
> CSR: https://bugs.openjdk.java.net/browse/JDK-8202302
>
> Current version:
> specdiff:
> http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_02/java/nio/file/Files.html
> webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v02/
>
> Previous version:
> specdiff:
> http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_v01/java/nio/file/Files.html
> webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v01/
>
> It's been a while, so here's a summary of what's changed:
>
> Spec: Alan's patch to fix inconsistencies in wording/terminology
> Impl: s/mismatchByAttrs and etc/isSameFile;
> Stick with the simple implementation (InputStream/Buffer) for now.
> Further improvement may be done if there's a need;
> The impl is smaller than the previous version, merged the code into
> Files, removed FilesMismatch.java;
>
> Test: more readable variables instead of the array of test files
> more comments on the testcases
> improved the private methods used for generating the test files
>
> Thanks,
> Joe
>
> On 9/21/18, 6:49 PM, Joe Wang wrote:
>>
>> On 9/20/18, 2:46 PM, Stuart Marks wrote:
>>>
>>>
>>> On 9/19/18 11:48 AM, Joe Wang wrote:
>>>> After much discussion and 10 iterations of reviews, this proposal has
>>>> evolved from what was the original isSameContent method to a mismatch
>>>> method. API-wise, a compare method was also considered as it looked like
>>>> just a short step forward from mismatch, however, it was eventually
>>>> dropped since there is no convincing use case comparing files
>>>> lexicographically by contents. Impl-wise, extensive performance
>>>> benchmarking has been done to compare a buffered reading vs memory
>>>> mapping, the result was that a simple buffered reading performed better
>>>> among small files, and those with the mismatched byte closer to the
>>>> beginning of files. Since the proposed method's targeted files are small
>>>> ones, the impl currently does a buffered reading only.
>>>
>>> Hi Joe,
>>>
>>> Thanks for being persistent with this one!
>>
>> Thanks for the help, and the "lot of stuff" with clear indicators (e.g. line
>> numbers and etc.)! From JDK 11 to 12, hopefully wont be deferred to 13 :-)
>>>
>>> A very small spec nitpick:
>>>
>>>> specdiff:
>>>> http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff/java/nio/file/Files.html
>>>
>>> 1544 * <li> {@link #isSameFile(Path, Path) isSameFile(path, path2)}
>>> returns true,</li>
>>>
>>> I would add "or" after the trailing comma. This makes it similar to the
>>> two-item list that follows.
>>
>> Done.
>>>
>>>> webrev: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev/
>>>
>>> A couple minor comments on FilesMismatch.java:
>>>
>>> If mismatchByAttrs() is replaced with a call to isSameFile(), as Alan
>>> suggested, this simplifies things considerably. It looks like the
>>> mismatch() implementation will be reduced to around ~40 lines of code. Does
>>> it deserve its own file, or can it be placed directly into Files.java? That
>>> file has over 4,000 lines already though.
>>
>> I merged the private methods after removing mismatchByAttrs(), that reduced
>> the code to 33 lines, and then with the change you suggested below, it's
>> further reduced to 29 lines. I'll put them in Files for now.
>>
>> Files is too big, ~180K, that compares with the next one ~45K among files
>> in the nio/file package. Unfortunately, the great majority of them are
>> specs, probably 90%.
>>>
>>> 106 if (nRead1 == 0 || nRead2 == 0) {
>>> 107 if (nRead1 == 0 && nRead2 == 0) {
>>> 108 // both files reach EOF
>>> 109 return -1;
>>> 110 } else {
>>> 111 // one of the files reaches EOF
>>> 112 return totalRead;
>>> 113 }
>>> 114 } else if (nRead1 != nRead2) {
>>>
>>> I think it reads slightly better if the nested 'if' at lines 107-113 were
>>> flattened into the else-if chain, like so:
>>>
>>> if (nRead1 == 0 && nRead2 == 0) {
>>> // both files reached EOF
>>> return -1;
>>> } else if (nRead1 == 0 || nRead2 == 0) {
>>> // one but not both files reached EOF
>>> return totalRead;
>>> } else if (nRead1 != nRead2) {
>>>
>>> There are a couple places where lines can be joined. The resulting lines
>>> are longer than 80 characters but are still less than 100 characters, which
>>> I think is OK. (I don't think we have a strict limit of 80 anymore.)
>>>
>>> 97 private static long mismatch(InputStream in1, InputStream in2)
>>> 98 throws IOException {
>>>
>>> 117 return totalRead +
>>> 118 Arrays.mismatch(buffer1, 0, len, buffer2, 0, len);
>>
>> Done. 95 characters :-)
>>
>>>
>>> Comments on tests in Comparison.java:
>>>
>>> * This file, and several names within this file, should probably should be
>>> renamed to focus on mismatch instead of comparison.
>>
>> Will rename to "mismatch". Previously, it covers both mismatch and compare.
>>>
>>> * I'm finding it quite difficult to read the test cases.
>>>
>>> The test files are placed into a Path[] and referenced by index. This makes
>>> it quite difficult to determine, for example, if all six cases within the
>>> mismatch loop are covered.
>>>
>>> Instead of storing the files at specific array indexes, I'd suggest doing
>>> the following.
>>>
>>> 1) Establish good naming conventions for test files, including size
>>> information, covering duplicate files (e.g., 'a' and 'b'), and a
>>> modification at a particular offset. For example, you might do something
>>> like the following:
>>>
>>> test120a - a test file with size 120
>>> test120b - a test file with size 120, identical to test120a
>>> test120m110 - a test file with size 120, with an altered byte at 110
>>> etc.
>>
>> I can do that. I had hoped the Assertion msg, e.g. "Both files are empty, no
>> mismatch" would help. Along with renaming variables, I will also add to the
>> Assertion Msg with test case number, e.g. "Case#1: both files are empty,
>> return no mismatch".
>>>
>>> 2) Declare fields of the same name containing the file paths.
>>>
>>> 3) Add some logic in prepareTestFile() to append each file's path to a list
>>> after the file is created. Then have the @AfterClass method run through
>>> this list and delete the files.
>>
>> May also create the test files in a folder and delete it @AfterClass.
>>>
>>> 4) The test data generator method can then have lines of test data that
>>> reference the test files using a meaningful name, so it's easy to
>>> understand the names and the expected results. Also, the test data
>>> generator should have a comment that describes the test method parameters.
>>
>> Sure, will add comments and etc.
>>>
>>> * The generation of the test data files seems overly complicated.
>>>
>>> A StringBuilder is used to accumulate data, then it's turned into a String,
>>> then (optionally) a character is replaced by the insertChar() method, and
>>> then the string is written to the file.
>>>
>>> I think you can just accumulate the test data into a StringBuilder,
>>> (optionally) call sb.setCharAt() to modify the data, and then write the SB
>>> to the file. After all, Files.writeString() takes a CharSequence.
>>>
>>> The generateFixedString() method should be changed to create and return a
>>> fresh StringBuilder with the desired contents. (See above.) It can also be
>>> simplified quite a bit.
>>
>> Wouldn't it be nice if String has an utility method that generates random
>> ASCII strings? ;-) It can be faster with its internal byte array.
>>>
>>> I don't think you need to have methods with Charset parameters. It's
>>> reasonable to have the test files be text, which makes things easier to
>>> develop and debug. But the test depends on characters being written in a
>>> single-byte charset, because mismatch() reports *byte* positions and not
>>> *character* positions.
>>>
>>> Using the method overloads that are implicitly UTF-8 is incorrect. If a
>>> non-single-byte character were to slip into the test data (seems unlikely,
>>> but possible), it'll be expanded to multiple bytes by the UTF-8 encoder,
>>> which may throw off the result. Instead I'd suggest hardwiring the charset
>>> to be US_ASCII to enforce the one-byte-per-character invariant.
>>>
>>> This is a lot of stuff. If you can't follow it all, perhaps we can work
>>> together directly and go over the test file.
>>
>> I'll improve the test, and US_ASCII it is! I'll probably use a string of
>> fixed length to generate the longest string the test needs, and then get
>> substrings for other test cases, that may simplify the process.
>>
>> Thanks,
>> Joe
>>
>>>
>>> Thanks,
>>>
>>> s'marks