Re: Hashing files/bytes Re: RFR(JDK11/NIO) 8202285: (fs) Add a method to Files for comparing file contents

2018-05-02 Thread forax
- Mail original -
> De: "John Rose" 
> À: "Remi Forax" 
> Cc: "Paul Sandoz" , "nio-dev" 
> , "core-libs-dev"
> 
> Envoyé: Mercredi 2 Mai 2018 07:35:38
> Objet: Re: Hashing files/bytes  Re: RFR(JDK11/NIO) 8202285: (fs) Add a 
> method to Files for comparing file contents

> Here's another potential stacking:
> 
> Define an interface ByteSequence, similar to CharSequence,
> as a zero-copy reference to some stored bytes somewhere.
> (Give it a long length.)  Define bulk methods on it like hash
> and mismatch and transferTo.  Then make File and ByteBuffer
> implement it.  Deal with the cross-product of source and
> destination types underneath the interface.
> 
> (Also I want ByteSequence as a way to encapsulate resource
> data for class files and condy, using zero-copy methods.
> The types byte[] and String don't scale and require copies.)

your ByteSequence is ByteBuffer !
a ByteBuffer can be a mapped file or wrapped a byte array,
mismatch is compareTo, transferTo is put(ByteBuffer), and hash should be 
messageDigest.digest(ByteBuffer) which doesn't exist but should.

> 
> — John

Rémi

> 
> On May 1, 2018, at 3:04 PM, fo...@univ-mlv.fr wrote:
>> 
>> - Mail original -
>>> De: "Paul Sandoz" 
>>> À: "Remi Forax" 
>>> Cc: "Alan Bateman" , "nio-dev"
>>> , "core-libs-dev"
>>> 
>>> Envoyé: Mardi 1 Mai 2018 00:37:57
>>> Objet: Hashing files/bytes  Re: RFR(JDK11/NIO) 8202285: (fs) Add a 
>>> method
>>> to Files for comparing file contents
>> 
>>> Thanks, better then i expected with the transferTo method we recently 
>>> added, but
>>> i think we could do even better for the ease of use case of “give me the 
>>> hash
>>> of this file contents or these bytes or this byte buffer".
>> 
>> yes, it can be a nice addition to java.nio.file.Files and in that case the
>> method that compare content can have reference in its documentation to this 
>> new
>> method.
>> 
>>> 
>>> Paul.
>> 
>> Rémi
>> 
>>> 
 On Apr 30, 2018, at 3:23 PM, Remi Forax  wrote:
 
> 
> To Remi’s point this might dissuade/guide developers from using this 
> method when
> there are other more efficient techniques available when operating at 
> larger
> scales. However, it is unfortunately harder that it should be in Java to 
> hash
> the contents of a file, a byte[] or ByteBuffer, according to some chosen
> algorithm (or a good default).
 
 it's 6 lines of code
 
 var digest = MessageDigest.getInstance("SHA1");
 try(var input = Files.newInputStream(Path.of("myfile.txt"));
 var output = new DigestOutputStream(OutputStream.nullOutputStream(), 
 digest)) {
   input.transferTo(output);
 }
 var hash = digest.digest();
 
 or 3 lines if you don't mind to load the whole file in memory
 
 var digest = MessageDigest.getInstance("SHA1");
 digest.update(Files.readAllBytes(Path.of("myfile.txt")));
 var hash = digest.digest();
 
> 
> Paul.
 
> >>> Rémi


Re: Hashing files/bytes Re: RFR(JDK11/NIO) 8202285: (fs) Add a method to Files for comparing file contents

2018-05-01 Thread John Rose
Here's another potential stacking:

Define an interface ByteSequence, similar to CharSequence,
as a zero-copy reference to some stored bytes somewhere.
(Give it a long length.)  Define bulk methods on it like hash
and mismatch and transferTo.  Then make File and ByteBuffer
implement it.  Deal with the cross-product of source and
destination types underneath the interface.

(Also I want ByteSequence as a way to encapsulate resource
data for class files and condy, using zero-copy methods.
The types byte[] and String don't scale and require copies.)

— John

On May 1, 2018, at 3:04 PM, fo...@univ-mlv.fr wrote:
> 
> - Mail original -
>> De: "Paul Sandoz" 
>> À: "Remi Forax" 
>> Cc: "Alan Bateman" , "nio-dev" 
>> , "core-libs-dev"
>> 
>> Envoyé: Mardi 1 Mai 2018 00:37:57
>> Objet: Hashing files/bytes  Re: RFR(JDK11/NIO) 8202285: (fs) Add a 
>> method to Files for comparing file contents
> 
>> Thanks, better then i expected with the transferTo method we recently added, 
>> but
>> i think we could do even better for the ease of use case of “give me the hash
>> of this file contents or these bytes or this byte buffer".
> 
> yes, it can be a nice addition to java.nio.file.Files and in that case the 
> method that compare content can have reference in its documentation to this 
> new method.
> 
>> 
>> Paul.
> 
> Rémi
> 
>> 
>>> On Apr 30, 2018, at 3:23 PM, Remi Forax  wrote:
>>> 
 
 To Remi’s point this might dissuade/guide developers from using this 
 method when
 there are other more efficient techniques available when operating at 
 larger
 scales. However, it is unfortunately harder that it should be in Java to 
 hash
 the contents of a file, a byte[] or ByteBuffer, according to some chosen
 algorithm (or a good default).
>>> 
>>> it's 6 lines of code
>>> 
>>> var digest = MessageDigest.getInstance("SHA1");
>>> try(var input = Files.newInputStream(Path.of("myfile.txt"));
>>> var output = new DigestOutputStream(OutputStream.nullOutputStream(), 
>>> digest)) {
>>>   input.transferTo(output);
>>> }
>>> var hash = digest.digest();
>>> 
>>> or 3 lines if you don't mind to load the whole file in memory
>>> 
>>> var digest = MessageDigest.getInstance("SHA1");
>>> digest.update(Files.readAllBytes(Path.of("myfile.txt")));
>>> var hash = digest.digest();
>>> 
 
 Paul.
>>> 
>>> Rémi



Re: Hashing files/bytes Re: RFR(JDK11/NIO) 8202285: (fs) Add a method to Files for comparing file contents

2018-05-01 Thread forax
- Mail original -
> De: "Paul Sandoz" 
> À: "Remi Forax" 
> Cc: "Alan Bateman" , "nio-dev" 
> , "core-libs-dev"
> 
> Envoyé: Mardi 1 Mai 2018 00:37:57
> Objet: Hashing files/bytes  Re: RFR(JDK11/NIO) 8202285: (fs) Add a 
> method to Files for comparing file contents

> Thanks, better then i expected with the transferTo method we recently added, 
> but
> i think we could do even better for the ease of use case of “give me the hash
> of this file contents or these bytes or this byte buffer".

yes, it can be a nice addition to java.nio.file.Files and in that case the 
method that compare content can have reference in its documentation to this new 
method.

> 
> Paul.

Rémi

> 
>> On Apr 30, 2018, at 3:23 PM, Remi Forax  wrote:
>> 
>>> 
>>> To Remi’s point this might dissuade/guide developers from using this method 
>>> when
>>> there are other more efficient techniques available when operating at larger
>>> scales. However, it is unfortunately harder that it should be in Java to 
>>> hash
>>> the contents of a file, a byte[] or ByteBuffer, according to some chosen
>>> algorithm (or a good default).
>> 
>> it's 6 lines of code
>> 
>>  var digest = MessageDigest.getInstance("SHA1");
>>  try(var input = Files.newInputStream(Path.of("myfile.txt"));
>>  var output = new DigestOutputStream(OutputStream.nullOutputStream(), 
>> digest)) {
>>input.transferTo(output);
>>  }
>>  var hash = digest.digest();
>> 
>> or 3 lines if you don't mind to load the whole file in memory
>> 
>>  var digest = MessageDigest.getInstance("SHA1");
>>  digest.update(Files.readAllBytes(Path.of("myfile.txt")));
>>  var hash = digest.digest();
>> 
>>> 
>>> Paul.
>> 
> > Rémi


Re: RFR(JDK11/NIO) 8202285: (fs) Add a method to Files for comparing file contents

2018-05-01 Thread Joe Wang

Thanks John for the background and detailed information.

-Joe

On 4/30/2018 6:18 PM, John Rose wrote:
On Apr 30, 2018, at 4:47 PM, Joe Wang > wrote:


Are there real-life use cases? It may be useful for example to check 
if the files have the same header.


After equality comparison, lexical comparison is a key use case.
By allowing the user to interpret the data around the mismatch,
the comparison can be made sensitive to things like locales.

As Paul implies, finding a mismatch is the correct operation to build
equality checks on top of, because (a) a mismatch has to be detected
anyway to prove inequality, and (b) giving the location of the mismatch,
instead of throwing it away, unlocks a variety of other operations.

If you want real-life use cases, look at uses of /usr/bin/cmp in Unix
shell scripts.  The cmp command is to Unix files what Paul's array
mismatch methods are to Java arrays.  Here's a man page reference:

https://docs.oracle.com/cd/E19683-01/816-0210/6m6nb7m6c/index.html

As with the array mismatch methods, the cmp command allows the
user to specify optional offsets within each file to start comparing, as
well as an optional length to stop comparing after.

See the file BufferMismatch.java for the (partial) application of these
ideas to NIO buffers.

I suppose the Java-flavored version of "cmp - file" would be a file
comparator which would take a byte buffer as a second operand,
and return an indication of the location of the mismatch.  Note that
"cmp - file" compares a computed stream against a stored file.

I think Paul and I have sketched a natural "sweet spot" for performing
bitwise comparisons on stored data.  It's up to you how much to implement.
I suggest that, if you don't feel inspired to do it all in one go, 
that you

leave room in the code for future expansions (maybe as with
BufferMismatch), and perhaps file a follow-up RFE.

— John





Re: RFR(JDK11/NIO) 8202285: (fs) Add a method to Files for comparing file contents

2018-04-30 Thread Joe Wang
I see. Generalization vs solution in a specific scope, it's kind of a 
balancing art indeed :-)


-Joe

On 4/30/2018 5:13 PM, Paul Sandoz wrote:



On Apr 30, 2018, at 4:47 PM, Joe Wang  wrote:

—

It’s tempting (well to me at least) to generalize to a mismatch method (like 
for arrays) returning the mismatching location in bytes, then you can determine 
if one file is a prefix of another given the files sizes. Bound accepting 
methods would also be useful to mismatch on partial content (including within 
the same file). If you use memory mapped files we can use direct byte buffers 
to efficiently perform the mismatch.

Are there real-life use cases?  It may be useful for example to check if the 
files have the same header.


Yes, something like that. I was just searching for a more general abstraction 
e.g. mismatch, that can support equality and lexicographical comparison of file 
contents. Other use-cases tend pop out almost for free because of that :-) 
However, its possible to support the more advanced cases directly with mapped 
byte buffers.

The good news is you can add isSameContent and if there is demand for mismatch 
add that, deriving the implementation of isSameContent from the new method.

Paul.


We did a bit of use-case study where we compared a bunch of possible options, 
including read string with bound, or by specifying patterns, and/or read into a list 
with a regex/pattern as separator (vs the default line-separator). We concluded that 
readString is a popular demand, and it's usually a quick read of small files, e.g. a 
config file, a SQL query file and etc. The methods fulfill the process of String 
<==> File transformation, a straight and quick way of converting a String to 
File and vice versa.

The demand for isSameContent isn't necessarily as popular as readString, but there 
were still some real use cases where people asked how to do it quickly. When we have 
String <==> File, it's natural to at least have a comparison method since 
String.equal is essential to it. Plus, we already had isSameFile.

Best,
Joe


To Remi’s point this might dissuade/guide developers from using this method 
when there are other more efficient techniques available when operating at 
larger scales. However, it is unfortunately harder that it should be in Java to 
hash the contents of a file, a byte[] or ByteBuffer, according to some chosen 
algorithm (or a good default).

Paul.




Re: RFR(JDK11/NIO) 8202285: (fs) Add a method to Files for comparing file contents

2018-04-30 Thread John Rose
On Apr 30, 2018, at 4:47 PM, Joe Wang  wrote:
> 
> Are there real-life use cases?  It may be useful for example to check if the 
> files have the same header.

After equality comparison, lexical comparison is a key use case.
By allowing the user to interpret the data around the mismatch,
the comparison can be made sensitive to things like locales.

As Paul implies, finding a mismatch is the correct operation to build
equality checks on top of, because (a) a mismatch has to be detected
anyway to prove inequality, and (b) giving the location of the mismatch,
instead of throwing it away, unlocks a variety of other operations.

If you want real-life use cases, look at uses of /usr/bin/cmp in Unix
shell scripts.  The cmp command is to Unix files what Paul's array
mismatch methods are to Java arrays.  Here's a man page reference:

https://docs.oracle.com/cd/E19683-01/816-0210/6m6nb7m6c/index.html

As with the array mismatch methods, the cmp command allows the
user to specify optional offsets within each file to start comparing, as
well as an optional length to stop comparing after.

See the file BufferMismatch.java for the (partial) application of these
ideas to NIO buffers.

I suppose the Java-flavored version of "cmp - file" would be a file
comparator which would take a byte buffer as a second operand,
and return an indication of the location of the mismatch.  Note that
"cmp - file" compares a computed stream against a stored file.

I think Paul and I have sketched a natural "sweet spot" for performing
bitwise comparisons on stored data.  It's up to you how much to implement.
I suggest that, if you don't feel inspired to do it all in one go, that you
leave room in the code for future expansions (maybe as with
BufferMismatch), and perhaps file a follow-up RFE.

— John



Re: RFR(JDK11/NIO) 8202285: (fs) Add a method to Files for comparing file contents

2018-04-30 Thread Paul Sandoz


> On Apr 30, 2018, at 4:47 PM, Joe Wang  wrote:
>> 
>> —
>> 
>> It’s tempting (well to me at least) to generalize to a mismatch method (like 
>> for arrays) returning the mismatching location in bytes, then you can 
>> determine if one file is a prefix of another given the files sizes. Bound 
>> accepting methods would also be useful to mismatch on partial content 
>> (including within the same file). If you use memory mapped files we can use 
>> direct byte buffers to efficiently perform the mismatch.
> 
> Are there real-life use cases?  It may be useful for example to check if the 
> files have the same header.
> 

Yes, something like that. I was just searching for a more general abstraction 
e.g. mismatch, that can support equality and lexicographical comparison of file 
contents. Other use-cases tend pop out almost for free because of that :-) 
However, its possible to support the more advanced cases directly with mapped 
byte buffers.

The good news is you can add isSameContent and if there is demand for mismatch 
add that, deriving the implementation of isSameContent from the new method.

Paul.

> We did a bit of use-case study where we compared a bunch of possible options, 
> including read string with bound, or by specifying patterns, and/or read into 
> a list with a regex/pattern as separator (vs the default line-separator). We 
> concluded that readString is a popular demand, and it's usually a quick read 
> of small files, e.g. a config file, a SQL query file and etc. The methods 
> fulfill the process of String <==> File transformation, a straight and quick 
> way of converting a String to File and vice versa.
> 
> The demand for isSameContent isn't necessarily as popular as readString, but 
> there were still some real use cases where people asked how to do it quickly. 
> When we have String <==> File, it's natural to at least have a comparison 
> method since String.equal is essential to it. Plus, we already had isSameFile.
> 
> Best,
> Joe
> 
>> 
>> To Remi’s point this might dissuade/guide developers from using this method 
>> when there are other more efficient techniques available when operating at 
>> larger scales. However, it is unfortunately harder that it should be in Java 
>> to hash the contents of a file, a byte[] or ByteBuffer, according to some 
>> chosen algorithm (or a good default).
>> 
>> Paul.
> 



Re: RFR(JDK11/NIO) 8202285: (fs) Add a method to Files for comparing file contents

2018-04-30 Thread Joe Wang
Good point. So even if we go with "has" (instead of "is"), it'd be 
hasSameContent, semantically, it would be "has the same content".


-Joe

On 4/30/2018 4:10 PM, Jonathan Gibbons wrote:
At the risk of triggering a #bikeshed on the relative merits of 
"content" vs. "contents", I note that String has put a stake in the 
ground for the singular form, with contentEquals.


https://docs.oracle.com/javase/9/docs/api/java/lang/String.html#contentEquals-java.lang.CharSequence- 



-- Jon


On 4/30/18 4:02 PM, Joe Wang wrote:

Hi Jonathan,

hasSameContents does read better in English. This one was made 
isSameContent since I thought we'd want to stack it next to the 
existing isSameFile method since it's meant to be an extend to that 
method. I'd love to hear what people think about this. I'm open to 
change the name if there's a good consensus.


Cheers,
Joe

On 4/27/2018 4:01 AM, Jonathan Bluett-Duncan wrote:

Hi Joe,

I wonder if the method `isSameContent` should be named 
`haveSameContents` so as to read more fluently in English.


Cheers,
Jonathan

On 27 April 2018 at 11:58, Daniel Fuchs > wrote:


    Hi Joe,

    On the specification side, I think I would reword the API
    documentation to first explain how the method checks the
    content of the two files.

    The fact that it doesn't check the actual content if
    the two files are 'the same' is kind of an optimization.

    So I would suggest to invert the order of the two paragraph
    in the documentation, and combine them into one - something like:

    1536      * 
              * This method first calls {@link
    #isSameFile(java.nio.file.Path, java.nio.file.Path)
    isSameFile(path, path2)} to determine whether the two files are
    the same.
    1537      * If {@code isSameFile(path, path2)} returns false, this
    method will proceed
    1538      * to read the files and compare them byte by byte to
    determine if they contain
    1539      * the same contents.
              * Otherwise, this method will return true without further
              * processing.


    On the implementation side I don't think it's reasonable to call
    readAllBytes() and hold the content of the two files in memory
    for comparing their content, especially if it's to discover that
    the first byte differs.

    Some lock-step reading of the two files would seem more 
appropriate.


    best regards,

    -- daniel





    On 27/04/2018 05:51, Joe Wang wrote:

    Hi,

    Considering extending isSameFile to add isSameContent to
    Files. Please review.

    JBS: https://bugs.openjdk.java.net/browse/JDK-8202285


    webrev:
http://cr.openjdk.java.net/~joehw/jdk11/8202285/webrev/


    specdiff:
http://cr.openjdk.java.net/~joehw/jdk11/8202285/specdiff/java/nio/file/Files.html 

 




    Thanks,
    Joe











Re: RFR(JDK11/NIO) 8202285: (fs) Add a method to Files for comparing file contents

2018-04-30 Thread Joe Wang


On 4/30/2018 11:47 AM, Paul Sandoz wrote:



On Apr 27, 2018, at 4:30 AM, Alan Bateman  wrote:

On 27/04/2018 05:51, Joe Wang wrote:

Hi,

Considering extending isSameFile to add isSameContent to Files. Please review.

JBS: https://bugs.openjdk.java.net/browse/JDK-8202285

webrev: http://cr.openjdk.java.net/~joehw/jdk11/8202285/webrev/

specdiff: 
http://cr.openjdk.java.net/~joehw/jdk11/8202285/specdiff/java/nio/file/Files.html

I assume we should ignore the implementation for now as the eventual 
implementation won't use readAllBytes (at least not for for large files).


Yes, as long as we don’t forget to follow up on a replacement (using memory 
mapped files say).


True, updated now :-)



The existing isSameFile is specified as "Tests if two paths locate the same file" and it 
would be good if the new method could be somewhat consistent with that, e.g. "Tests if the 
content of two files is identical".

Specifying that two path that locate the same file always returns true is 
reasonable. This could be make clearer by say that the returning always returns 
true when path and path2 are equals, if event if the file does not exist.

The @return should say that it returns true if path and path2 locate the same 
file or the content of both files is identical.

The javadoc for SecurityException has "to the file", I assume this should be 
"to both files”.


We might also want to say the contents of the two files are assumed to be held 
constant during the operation.


Added a statement.


—

It’s tempting (well to me at least) to generalize to a mismatch method (like 
for arrays) returning the mismatching location in bytes, then you can determine 
if one file is a prefix of another given the files sizes. Bound accepting 
methods would also be useful to mismatch on partial content (including within 
the same file). If you use memory mapped files we can use direct byte buffers 
to efficiently perform the mismatch.


Are there real-life use cases?  It may be useful for example to check if 
the files have the same header.


We did a bit of use-case study where we compared a bunch of possible 
options, including read string with bound, or by specifying patterns, 
and/or read into a list with a regex/pattern as separator (vs the 
default line-separator). We concluded that readString is a popular 
demand, and it's usually a quick read of small files, e.g. a config 
file, a SQL query file and etc. The methods fulfill the process of 
String <==> File transformation, a straight and quick way of converting 
a String to File and vice versa.


The demand for isSameContent isn't necessarily as popular as readString, 
but there were still some real use cases where people asked how to do it 
quickly. When we have String <==> File, it's natural to at least have a 
comparison method since String.equal is essential to it. Plus, we 
already had isSameFile.


Best,
Joe



To Remi’s point this might dissuade/guide developers from using this method 
when there are other more efficient techniques available when operating at 
larger scales. However, it is unfortunately harder that it should be in Java to 
hash the contents of a file, a byte[] or ByteBuffer, according to some chosen 
algorithm (or a good default).

Paul.




Re: RFR(JDK11/NIO) 8202285: (fs) Add a method to Files for comparing file contents

2018-04-30 Thread Joe Wang
First, this is intended to be an extension to the existing isSameFile 
method since it stopped short of comparing the content to answer the 
query for whether two files are equal.


We did a review/a bit of research on user demand. Comparing files isn't 
as high as for example readString, but there's a fair number of people 
who were interested in determining if two files have the same content. 
It would be nice if you could point us the evidence on comparing a file 
against a batch of other files as being the usual use case.


Comparing one against many, hashing would be more efficient. Between two 
files, byte-by-byte would be the error (albeit tiny chance) free choice.


Thanks,
Joe

On 4/27/2018 4:37 AM, Remi Forax wrote:

This seems to promote the wrong way to do such thing,
the usual use case is that you want to compare the content of a well know file 
with the content of a bunch of other files, so hashing is better.

Rémi

- Mail original -

De: "Joe Wang" 
À: "nio-dev" , "core-libs-dev" 

Envoyé: Vendredi 27 Avril 2018 06:51:08
Objet: RFR(JDK11/NIO) 8202285: (fs) Add a method to Files for comparing file 
contents
Hi,

Considering extending isSameFile to add isSameContent to Files. Please
review.

JBS: https://bugs.openjdk.java.net/browse/JDK-8202285

webrev: http://cr.openjdk.java.net/~joehw/jdk11/8202285/webrev/

specdiff:
http://cr.openjdk.java.net/~joehw/jdk11/8202285/specdiff/java/nio/file/Files.html

Thanks,
Joe




Re: RFR(JDK11/NIO) 8202285: (fs) Add a method to Files for comparing file contents

2018-04-30 Thread Jonathan Gibbons
At the risk of triggering a #bikeshed on the relative merits of 
"content" vs. "contents", I note that String has put a stake in the 
ground for the singular form, with contentEquals.


https://docs.oracle.com/javase/9/docs/api/java/lang/String.html#contentEquals-java.lang.CharSequence-

-- Jon


On 4/30/18 4:02 PM, Joe Wang wrote:

Hi Jonathan,

hasSameContents does read better in English. This one was made 
isSameContent since I thought we'd want to stack it next to the 
existing isSameFile method since it's meant to be an extend to that 
method. I'd love to hear what people think about this. I'm open to 
change the name if there's a good consensus.


Cheers,
Joe

On 4/27/2018 4:01 AM, Jonathan Bluett-Duncan wrote:

Hi Joe,

I wonder if the method `isSameContent` should be named 
`haveSameContents` so as to read more fluently in English.


Cheers,
Jonathan

On 27 April 2018 at 11:58, Daniel Fuchs > wrote:


    Hi Joe,

    On the specification side, I think I would reword the API
    documentation to first explain how the method checks the
    content of the two files.

    The fact that it doesn't check the actual content if
    the two files are 'the same' is kind of an optimization.

    So I would suggest to invert the order of the two paragraph
    in the documentation, and combine them into one - something like:

    1536      * 
              * This method first calls {@link
    #isSameFile(java.nio.file.Path, java.nio.file.Path)
    isSameFile(path, path2)} to determine whether the two files are
    the same.
    1537      * If {@code isSameFile(path, path2)} returns false, this
    method will proceed
    1538      * to read the files and compare them byte by byte to
    determine if they contain
    1539      * the same contents.
              * Otherwise, this method will return true without further
              * processing.


    On the implementation side I don't think it's reasonable to call
    readAllBytes() and hold the content of the two files in memory
    for comparing their content, especially if it's to discover that
    the first byte differs.

    Some lock-step reading of the two files would seem more appropriate.

    best regards,

    -- daniel





    On 27/04/2018 05:51, Joe Wang wrote:

    Hi,

    Considering extending isSameFile to add isSameContent to
    Files. Please review.

    JBS: https://bugs.openjdk.java.net/browse/JDK-8202285
    

    webrev:
    http://cr.openjdk.java.net/~joehw/jdk11/8202285/webrev/


    specdiff:
http://cr.openjdk.java.net/~joehw/jdk11/8202285/specdiff/java/nio/file/Files.html



    Thanks,
    Joe









Re: RFR(JDK11/NIO) 8202285: (fs) Add a method to Files for comparing file contents

2018-04-30 Thread Joe Wang



On 4/27/2018 4:30 AM, Alan Bateman wrote:

On 27/04/2018 05:51, Joe Wang wrote:

Hi,

Considering extending isSameFile to add isSameContent to Files. 
Please review.


JBS: https://bugs.openjdk.java.net/browse/JDK-8202285

webrev: http://cr.openjdk.java.net/~joehw/jdk11/8202285/webrev/

specdiff: 
http://cr.openjdk.java.net/~joehw/jdk11/8202285/specdiff/java/nio/file/Files.html
I assume we should ignore the implementation for now as the eventual 
implementation won't use readAllBytes (at least not for for large files).


webrev was provided since sometimes it's helpful. But yeah, I've updated 
the impl.


The existing isSameFile is specified as "Tests if two paths locate the 
same file" and it would be good if the new method could be somewhat 
consistent with that, e.g. "Tests if the content of two files is 
identical".


Updated accordingly.


Specifying that two path that locate the same file always returns true 
is reasonable. This could be make clearer by say that the returning 
always returns true when path and path2 are equals, if event if the 
file does not exist.


Modified with a couple of bullet points, added the above to the first.


The @return should say that it returns true if path and path2 locate 
the same file or the content of both files is identical.


Added.


The javadoc for SecurityException has "to the file", I assume this 
should be "to both files".


Fixed too.

Thanks,
Joe



-Alan





Re: RFR(JDK11/NIO) 8202285: (fs) Add a method to Files for comparing file contents

2018-04-30 Thread Joe Wang

Hi Jonathan,

hasSameContents does read better in English. This one was made 
isSameContent since I thought we'd want to stack it next to the existing 
isSameFile method since it's meant to be an extend to that method. I'd 
love to hear what people think about this. I'm open to change the name 
if there's a good consensus.


Cheers,
Joe

On 4/27/2018 4:01 AM, Jonathan Bluett-Duncan wrote:

Hi Joe,

I wonder if the method `isSameContent` should be named 
`haveSameContents` so as to read more fluently in English.


Cheers,
Jonathan

On 27 April 2018 at 11:58, Daniel Fuchs > wrote:


Hi Joe,

On the specification side, I think I would reword the API
documentation to first explain how the method checks the
content of the two files.

The fact that it doesn't check the actual content if
the two files are 'the same' is kind of an optimization.

So I would suggest to invert the order of the two paragraph
in the documentation, and combine them into one - something like:

1536      * 
          * This method first calls {@link
#isSameFile(java.nio.file.Path, java.nio.file.Path)
isSameFile(path, path2)} to determine whether the two files are
the same.
1537      * If {@code isSameFile(path, path2)} returns false, this
method will proceed
1538      * to read the files and compare them byte by byte to
determine if they contain
1539      * the same contents.
          * Otherwise, this method will return true without further
          * processing.


On the implementation side I don't think it's reasonable to call
readAllBytes() and hold the content of the two files in memory
for comparing their content, especially if it's to discover that
the first byte differs.

Some lock-step reading of the two files would seem more appropriate.

best regards,

-- daniel





On 27/04/2018 05:51, Joe Wang wrote:

Hi,

Considering extending isSameFile to add isSameContent to
Files. Please review.

JBS: https://bugs.openjdk.java.net/browse/JDK-8202285


webrev:
http://cr.openjdk.java.net/~joehw/jdk11/8202285/webrev/


specdiff:

http://cr.openjdk.java.net/~joehw/jdk11/8202285/specdiff/java/nio/file/Files.html




Thanks,
Joe







Re: RFR(JDK11/NIO) 8202285: (fs) Add a method to Files for comparing file contents

2018-04-30 Thread Joe Wang

Hi Daniel,

Thanks for reviewing the proposal!

For the spec, or javadoc in general, the first sentence shall be a short 
summary of the method, a definition of what the method is. It appears in 
the method summary table and index. So in this case, this method "Tests 
if the content of two files is identical" -- I've phrased it with words 
as Alan suggested.


I've also updated the description to hopefully make it clear that this 
method extends the existing isSameFile method, and that the process 
builds on top of that operation.


For the impl, it was an indeed quick impl. with small files in mind. But 
a memory conscience impl is always right, I've changed it to read a 
chunk at a time.


Best,
Joe


On 4/27/2018 3:58 AM, Daniel Fuchs wrote:

Hi Joe,

On the specification side, I think I would reword the API
documentation to first explain how the method checks the
content of the two files.

The fact that it doesn't check the actual content if
the two files are 'the same' is kind of an optimization.

So I would suggest to invert the order of the two paragraph
in the documentation, and combine them into one - something like:

1536  * 
  * This method first calls {@link 
#isSameFile(java.nio.file.Path, java.nio.file.Path) isSameFile(path, 
path2)} to determine whether the two files are the same.
1537  * If {@code isSameFile(path, path2)} returns false, this 
method will proceed
1538  * to read the files and compare them byte by byte to 
determine if they contain

1539  * the same contents.
  * Otherwise, this method will return true without further
  * processing.


On the implementation side I don't think it's reasonable to call
readAllBytes() and hold the content of the two files in memory
for comparing their content, especially if it's to discover that
the first byte differs.

Some lock-step reading of the two files would seem more appropriate.

best regards,

-- daniel




On 27/04/2018 05:51, Joe Wang wrote:

Hi,

Considering extending isSameFile to add isSameContent to Files. 
Please review.


JBS: https://bugs.openjdk.java.net/browse/JDK-8202285

webrev: http://cr.openjdk.java.net/~joehw/jdk11/8202285/webrev/

specdiff: 
http://cr.openjdk.java.net/~joehw/jdk11/8202285/specdiff/java/nio/file/Files.html 



Thanks,
Joe







Re: RFR(JDK11/NIO) 8202285: (fs) Add a method to Files for comparing file contents

2018-04-30 Thread Joe Wang

Hi Bernd,

Thanks for the review. Please refer to the next email in this thread, 
I've changed it to read a chunk at a time instead.


Best,
Joe

On 4/27/2018 2:32 AM, Bernd Eckenfels wrote:

If this really stays this way and reads all bytes into memory it should at 
least state so, as this can easily overflow heap. Besides the Javadoc is pretty 
specific but fails to mention the size comparison.

Greetings
Bernd

Gruss
Bernd
--
http://bernd.eckenfels.net

From: core-libs-dev  on behalf of Joe Wang 

Sent: Friday, April 27, 2018 6:51:08 AM
To: nio-dev; core-libs-dev
Subject: RFR(JDK11/NIO) 8202285: (fs) Add a method to Files for comparing file 
contents

Hi,

Considering extending isSameFile to add isSameContent to Files. Please
review.

JBS: https://bugs.openjdk.java.net/browse/JDK-8202285

webrev: http://cr.openjdk.java.net/~joehw/jdk11/8202285/webrev/

specdiff:
http://cr.openjdk.java.net/~joehw/jdk11/8202285/specdiff/java/nio/file/Files.html

Thanks,
Joe





Hashing files/bytes Re: RFR(JDK11/NIO) 8202285: (fs) Add a method to Files for comparing file contents

2018-04-30 Thread Paul Sandoz
Thanks, better then i expected with the transferTo method we recently added, 
but i think we could do even better for the ease of use case of “give me the 
hash of this file contents or these bytes or this byte buffer".

Paul.

> On Apr 30, 2018, at 3:23 PM, Remi Forax  wrote:
> 
>> 
>> To Remi’s point this might dissuade/guide developers from using this method 
>> when
>> there are other more efficient techniques available when operating at larger
>> scales. However, it is unfortunately harder that it should be in Java to hash
>> the contents of a file, a byte[] or ByteBuffer, according to some chosen
>> algorithm (or a good default).
> 
> it's 6 lines of code
> 
>  var digest = MessageDigest.getInstance("SHA1");
>  try(var input = Files.newInputStream(Path.of("myfile.txt"));
>  var output = new DigestOutputStream(OutputStream.nullOutputStream(), 
> digest)) {
>input.transferTo(output);
>  }
>  var hash = digest.digest();
> 
> or 3 lines if you don't mind to load the whole file in memory
> 
>  var digest = MessageDigest.getInstance("SHA1");
>  digest.update(Files.readAllBytes(Path.of("myfile.txt")));
>  var hash = digest.digest();
> 
>> 
>> Paul.
> 
> Rémi



Re: RFR(JDK11/NIO) 8202285: (fs) Add a method to Files for comparing file contents

2018-04-30 Thread Remi Forax
- Mail original -
> De: "Paul Sandoz" 
> À: "Alan Bateman" 
> Cc: "nio-dev" , "core-libs-dev" 
> 
> Envoyé: Lundi 30 Avril 2018 20:47:06
> Objet: Re: RFR(JDK11/NIO) 8202285: (fs) Add a method to Files for comparing 
> file contents

>> On Apr 27, 2018, at 4:30 AM, Alan Bateman  wrote:
>> 
>> On 27/04/2018 05:51, Joe Wang wrote:
>>> Hi,
>>> 
>>> Considering extending isSameFile to add isSameContent to Files. Please 
>>> review.
>>> 
>>> JBS: https://bugs.openjdk.java.net/browse/JDK-8202285
>>> 
>>> webrev: http://cr.openjdk.java.net/~joehw/jdk11/8202285/webrev/
>>> 
>>> specdiff:
>>> http://cr.openjdk.java.net/~joehw/jdk11/8202285/specdiff/java/nio/file/Files.html
>> I assume we should ignore the implementation for now as the eventual
>> implementation won't use readAllBytes (at least not for for large files).
>> 
> 
> Yes, as long as we don’t forget to follow up on a replacement (using memory
> mapped files say).
> 
> 
>> The existing isSameFile is specified as "Tests if two paths locate the same
>> file" and it would be good if the new method could be somewhat consistent 
>> with
>> that, e.g. "Tests if the content of two files is identical".
>> 
>> Specifying that two path that locate the same file always returns true is
>> reasonable. This could be make clearer by say that the returning always 
>> returns
>> true when path and path2 are equals, if event if the file does not exist.
>> 
>> The @return should say that it returns true if path and path2 locate the same
>> file or the content of both files is identical.
>> 
>> The javadoc for SecurityException has "to the file", I assume this should be 
>> "to
>> both files”.
>> 
> 
> We might also want to say the contents of the two files are assumed to be held
> constant during the operation.
> 
> —
> 
> It’s tempting (well to me at least) to generalize to a mismatch method (like 
> for
> arrays) returning the mismatching location in bytes, then you can determine if
> one file is a prefix of another given the files sizes. Bound accepting methods
> would also be useful to mismatch on partial content (including within the same
> file). If you use memory mapped files we can use direct byte buffers to
> efficiently perform the mismatch.

I'm not sure memory mapping is a good idea, Windows is notoriously bad at 
memory mapping small files and if the files are big, see you own comment below.
But an implementation that reads byte buffers and compare them will be more 
efficient.

> 
> To Remi’s point this might dissuade/guide developers from using this method 
> when
> there are other more efficient techniques available when operating at larger
> scales. However, it is unfortunately harder that it should be in Java to hash
> the contents of a file, a byte[] or ByteBuffer, according to some chosen
> algorithm (or a good default).

it's 6 lines of code

  var digest = MessageDigest.getInstance("SHA1");
  try(var input = Files.newInputStream(Path.of("myfile.txt"));
  var output = new DigestOutputStream(OutputStream.nullOutputStream(), 
digest)) {
input.transferTo(output);
  }
  var hash = digest.digest();

or 3 lines if you don't mind to load the whole file in memory

  var digest = MessageDigest.getInstance("SHA1");
  digest.update(Files.readAllBytes(Path.of("myfile.txt")));
  var hash = digest.digest();

> 
> Paul.

Rémi


Re: RFR(JDK11/NIO) 8202285: (fs) Add a method to Files for comparing file contents

2018-04-30 Thread Paul Sandoz


> On Apr 27, 2018, at 4:30 AM, Alan Bateman  wrote:
> 
> On 27/04/2018 05:51, Joe Wang wrote:
>> Hi,
>> 
>> Considering extending isSameFile to add isSameContent to Files. Please 
>> review.
>> 
>> JBS: https://bugs.openjdk.java.net/browse/JDK-8202285
>> 
>> webrev: http://cr.openjdk.java.net/~joehw/jdk11/8202285/webrev/
>> 
>> specdiff: 
>> http://cr.openjdk.java.net/~joehw/jdk11/8202285/specdiff/java/nio/file/Files.html
> I assume we should ignore the implementation for now as the eventual 
> implementation won't use readAllBytes (at least not for for large files).
> 

Yes, as long as we don’t forget to follow up on a replacement (using memory 
mapped files say).


> The existing isSameFile is specified as "Tests if two paths locate the same 
> file" and it would be good if the new method could be somewhat consistent 
> with that, e.g. "Tests if the content of two files is identical".
> 
> Specifying that two path that locate the same file always returns true is 
> reasonable. This could be make clearer by say that the returning always 
> returns true when path and path2 are equals, if event if the file does not 
> exist.
> 
> The @return should say that it returns true if path and path2 locate the same 
> file or the content of both files is identical.
> 
> The javadoc for SecurityException has "to the file", I assume this should be 
> "to both files”.
> 

We might also want to say the contents of the two files are assumed to be held 
constant during the operation.

—

It’s tempting (well to me at least) to generalize to a mismatch method (like 
for arrays) returning the mismatching location in bytes, then you can determine 
if one file is a prefix of another given the files sizes. Bound accepting 
methods would also be useful to mismatch on partial content (including within 
the same file). If you use memory mapped files we can use direct byte buffers 
to efficiently perform the mismatch.

To Remi’s point this might dissuade/guide developers from using this method 
when there are other more efficient techniques available when operating at 
larger scales. However, it is unfortunately harder that it should be in Java to 
hash the contents of a file, a byte[] or ByteBuffer, according to some chosen 
algorithm (or a good default).

Paul.

Re: RFR(JDK11/NIO) 8202285: (fs) Add a method to Files for comparing file contents

2018-04-27 Thread Remi Forax
This seems to promote the wrong way to do such thing,
the usual use case is that you want to compare the content of a well know file 
with the content of a bunch of other files, so hashing is better.

Rémi

- Mail original -
> De: "Joe Wang" 
> À: "nio-dev" , "core-libs-dev" 
> 
> Envoyé: Vendredi 27 Avril 2018 06:51:08
> Objet: RFR(JDK11/NIO) 8202285: (fs) Add a method to Files for comparing file 
> contents

> Hi,
> 
> Considering extending isSameFile to add isSameContent to Files. Please
> review.
> 
> JBS: https://bugs.openjdk.java.net/browse/JDK-8202285
> 
> webrev: http://cr.openjdk.java.net/~joehw/jdk11/8202285/webrev/
> 
> specdiff:
> http://cr.openjdk.java.net/~joehw/jdk11/8202285/specdiff/java/nio/file/Files.html
> 
> Thanks,
> Joe


Re: RFR(JDK11/NIO) 8202285: (fs) Add a method to Files for comparing file contents

2018-04-27 Thread Alan Bateman

On 27/04/2018 05:51, Joe Wang wrote:

Hi,

Considering extending isSameFile to add isSameContent to Files. Please 
review.


JBS: https://bugs.openjdk.java.net/browse/JDK-8202285

webrev: http://cr.openjdk.java.net/~joehw/jdk11/8202285/webrev/

specdiff: 
http://cr.openjdk.java.net/~joehw/jdk11/8202285/specdiff/java/nio/file/Files.html
I assume we should ignore the implementation for now as the eventual 
implementation won't use readAllBytes (at least not for for large files).


The existing isSameFile is specified as "Tests if two paths locate the 
same file" and it would be good if the new method could be somewhat 
consistent with that, e.g. "Tests if the content of two files is identical".


Specifying that two path that locate the same file always returns true 
is reasonable. This could be make clearer by say that the returning 
always returns true when path and path2 are equals, if event if the file 
does not exist.


The @return should say that it returns true if path and path2 locate the 
same file or the content of both files is identical.


The javadoc for SecurityException has "to the file", I assume this 
should be "to both files".


-Alan



Re: RFR(JDK11/NIO) 8202285: (fs) Add a method to Files for comparing file contents

2018-04-27 Thread Jonathan Bluett-Duncan
Hi Joe,

I wonder if the method `isSameContent` should be named `haveSameContents`
so as to read more fluently in English.

Cheers,
Jonathan

On 27 April 2018 at 11:58, Daniel Fuchs  wrote:

> Hi Joe,
>
> On the specification side, I think I would reword the API
> documentation to first explain how the method checks the
> content of the two files.
>
> The fact that it doesn't check the actual content if
> the two files are 'the same' is kind of an optimization.
>
> So I would suggest to invert the order of the two paragraph
> in the documentation, and combine them into one - something like:
>
> 1536  * 
>   * This method first calls {@link #isSameFile(java.nio.file.Path,
> java.nio.file.Path) isSameFile(path, path2)} to determine whether the two
> files are the same.
> 1537  * If {@code isSameFile(path, path2)} returns false, this method
> will proceed
> 1538  * to read the files and compare them byte by byte to determine
> if they contain
> 1539  * the same contents.
>   * Otherwise, this method will return true without further
>   * processing.
>
>
> On the implementation side I don't think it's reasonable to call
> readAllBytes() and hold the content of the two files in memory
> for comparing their content, especially if it's to discover that
> the first byte differs.
>
> Some lock-step reading of the two files would seem more appropriate.
>
> best regards,
>
> -- daniel
>
>
>
>
>
> On 27/04/2018 05:51, Joe Wang wrote:
>
>> Hi,
>>
>> Considering extending isSameFile to add isSameContent to Files. Please
>> review.
>>
>> JBS: https://bugs.openjdk.java.net/browse/JDK-8202285
>>
>> webrev: http://cr.openjdk.java.net/~joehw/jdk11/8202285/webrev/
>>
>> specdiff: http://cr.openjdk.java.net/~joehw/jdk11/8202285/specdiff/jav
>> a/nio/file/Files.html
>>
>> Thanks,
>> Joe
>>
>>
>


Re: RFR(JDK11/NIO) 8202285: (fs) Add a method to Files for comparing file contents

2018-04-27 Thread Daniel Fuchs

Hi Joe,

On the specification side, I think I would reword the API
documentation to first explain how the method checks the
content of the two files.

The fact that it doesn't check the actual content if
the two files are 'the same' is kind of an optimization.

So I would suggest to invert the order of the two paragraph
in the documentation, and combine them into one - something like:

1536  * 
  * This method first calls {@link 
#isSameFile(java.nio.file.Path, java.nio.file.Path) isSameFile(path, 
path2)} to determine whether the two files are the same.
1537  * If {@code isSameFile(path, path2)} returns false, this 
method will proceed
1538  * to read the files and compare them byte by byte to determine 
if they contain

1539  * the same contents.
  * Otherwise, this method will return true without further
  * processing.


On the implementation side I don't think it's reasonable to call
readAllBytes() and hold the content of the two files in memory
for comparing their content, especially if it's to discover that
the first byte differs.

Some lock-step reading of the two files would seem more appropriate.

best regards,

-- daniel




On 27/04/2018 05:51, Joe Wang wrote:

Hi,

Considering extending isSameFile to add isSameContent to Files. Please 
review.


JBS: https://bugs.openjdk.java.net/browse/JDK-8202285

webrev: http://cr.openjdk.java.net/~joehw/jdk11/8202285/webrev/

specdiff: 
http://cr.openjdk.java.net/~joehw/jdk11/8202285/specdiff/java/nio/file/Files.html 



Thanks,
Joe





Re: RFR(JDK11/NIO) 8202285: (fs) Add a method to Files for comparing file contents

2018-04-27 Thread Bernd Eckenfels
If this really stays this way and reads all bytes into memory it should at 
least state so, as this can easily overflow heap. Besides the Javadoc is pretty 
specific but fails to mention the size comparison.

Greetings
Bernd

Gruss
Bernd
--
http://bernd.eckenfels.net

From: core-libs-dev  on behalf of Joe 
Wang 
Sent: Friday, April 27, 2018 6:51:08 AM
To: nio-dev; core-libs-dev
Subject: RFR(JDK11/NIO) 8202285: (fs) Add a method to Files for comparing file 
contents

Hi,

Considering extending isSameFile to add isSameContent to Files. Please
review.

JBS: https://bugs.openjdk.java.net/browse/JDK-8202285

webrev: http://cr.openjdk.java.net/~joehw/jdk11/8202285/webrev/

specdiff:
http://cr.openjdk.java.net/~joehw/jdk11/8202285/specdiff/java/nio/file/Files.html

Thanks,
Joe



RFR(JDK11/NIO) 8202285: (fs) Add a method to Files for comparing file contents

2018-04-26 Thread Joe Wang

Hi,

Considering extending isSameFile to add isSameContent to Files. Please 
review.


JBS: https://bugs.openjdk.java.net/browse/JDK-8202285

webrev: http://cr.openjdk.java.net/~joehw/jdk11/8202285/webrev/

specdiff: 
http://cr.openjdk.java.net/~joehw/jdk11/8202285/specdiff/java/nio/file/Files.html


Thanks,
Joe