On 4/26/23, David Wright <deb...@lionunicorn.co.uk> wrote: > I guess you need the expense of sha256 rather than md5 as you're > downloading the entire web?
I am not downloading the entire web. I have no way of knowing how they entertained those ideations but I think we could use their estimate when google said that approximately 1 million and a half books have been ever published. Think of it! It is not that much data. It would all fit nicely in one hard drive; include some searching capability and "bye bye google" will be the name of your movie. At times you need to gain a sense of things before going into exposed mode to search for something (which these days means making sure you are not being baited into something else) On 4/26/23, Dan Ritter <d...@randomstring.org> wrote: > The only characters used in the sha256 hash itself are [a-f] and > [0-9] Yes, I knew that; that is why I could not understand why sha256sum was being "courteous" to me. On 4/26/23, Nicolas George <geo...@nsup.org> wrote: > shaXsum always writes X/4 hexadecimal nibbles then two spaces then the > file name. If the input is from stdin, then the convention is the file > name is ‘-’. > > (Well, not always always: if the file name contains very special > characters, it will use an escaped output format. And there is the -z > option.) On 4/26/23, Thomas Schmitt <scdbac...@gmx.net> wrote: > "FILE" is the minus-sign for standard input. The second blank is there > to indicate the text mode of sha256sum. > Only the first blank is somewhat puzzling. But it's always there. > > > https://www.gnu.org/software/coreutils/manual/html_node/sha2-utilities#sha2-utilities > points to > > https://www.gnu.org/software/coreutils/manual/html_node/md5sum-invocation.html > which says > For each file, ‘md5sum’ outputs by default, the MD5 checksum, a space, > a flag indicating binary or text input mode, and the file name. Binary > mode is indicated with ‘*’, text mode with ‘ ’ (space). Binary mode is > the default on systems where it’s significant, otherwise text mode is > the default. The cksum command always uses binary mode and a ‘ ’ > (space) flag. > > So the first blank can be relied on and thus the proposal by Andy Smith > to use "awk '{print $1}'" is valid. OK, now I see why cutting off the string on the first space that appears is safe. I never saw such cases because I always used sha*sums on files. I would expect if a user enters a string via printf that was all there was to it. Of course, sha*sums can tell apart a file from a text string. On 4/26/23, Jeffrey Walton <noloa...@gmail.com> wrote: > There's no guarantee a URL will map onto a filesystem. > I seem to > recall Stunnel tried to do that in a caching mode, but it had weird > corner cases. (In addition to problems with filesystems that had > character set and path limitations). Well, no; and I am fine with: a) trying to best match both; the URL path as best as possible b) the extra malabarism base64-ing and hashsing the name of the file ... Something I have learned as a corpora research kind of guy is not to ever try to "educate" people. I would just take their sh!t as they dump it and cleanse, deal with it! You would not hear the end of it if I start telling stories of the kind of cr@p you find out there when you look at the web from that point of view. > I think your best bet is to digest the URL into a representation. I > suggest using SipHash+Base64 or Base64URL. SipHash provides collision > resistance, a uniform distribution, and its fast. SipHash has a very > good pedigree since it was designed by Jean-Philippe Aumasson and > Daniel J. Bernstein. The final Base64 or Base64URL encoding ensures > you stay within printable character range without reserved file system > characters. Thank you I will look into what they did when I get a chance, lbrtchx On 4/26/23, Albretch Mueller <lbrt...@gmail.com> wrote: > On 4/26/23, David Wright <deb...@lionunicorn.co.uk> wrote: >> I guess you need the expense of sha256 rather than md5 as you're >> downloading the entire web? > > I am not downloading the entire web. I have no way of knowing how > they entertained those ideations but I think we could use their > estimate when they said that approximately 1 million and a half books > have been ever published. Think of it! It is not that much data. It > would all fit nicely in one hard drive include some searching > capability and "bye bye google" will be the name of your movie. > > On 4/26/23, Dan Ritter <d...@randomstring.org> wrote: >> The only characters used in the sha256 hash itself are [a-f] and >> [0-9] > > Yes, I knew that; that is why I could not understand why sha256sum > was being "courteous" to me. > > On 4/26/23, Nicolas George <geo...@nsup.org> wrote: >> shaXsum always writes X/4 hexadecimal nibbles then two spaces then the >> file name. If the input is from stdin, then the convention is the file >> name is ‘-’. >> >> (Well, not always always: if the file name contains very special >> characters, it will use an escaped output format. And there is the -z >> option.) > > On 4/26/23, Thomas Schmitt <scdbac...@gmx.net> wrote: >> "FILE" is the minus-sign for standard input. The second blank is there >> to indicate the text mode of sha256sum. >> Only the first blank is somewhat puzzling. But it's always there. >> >> >> https://www.gnu.org/software/coreutils/manual/html_node/sha2-utilities#sha2-utilities >> points to >> >> https://www.gnu.org/software/coreutils/manual/html_node/md5sum-invocation.html >> which says >> For each file, ‘md5sum’ outputs by default, the MD5 checksum, a space, >> a flag indicating binary or text input mode, and the file name. Binary >> mode is indicated with ‘*’, text mode with ‘ ’ (space). Binary mode is >> the default on systems where it’s significant, otherwise text mode is >> the default. The cksum command always uses binary mode and a ‘ ’ >> (space) flag. >> >> So the first blank can be relied on and thus the proposal by Andy Smith >> to use "awk '{print $1}'" is valid. > > OK, now I see why cutting off the string on the first space that > appears is safe. I never saw such cases because I always used sha*sums > on files. I would expect if a user enters a string via printf that was > all there was to it. Of course, sha*sums can tell apart a file from a > string a plain text. > > On 4/26/23, Jeffrey Walton <noloa...@gmail.com> wrote: >> There's no guarantee a URL will map onto a filesystem. > >> I seem to >> recall Stunnel tried to do that in a caching mode, but it had weird >> corner cases. (In addition to problems with filesystems that had >> character set and path limitations). > > Well, no; and I am fine with: > a) trying to best match both; the URL path as best as possible > b) the extra malabarism base64-ing and hashsing the name of the file ... > > Something I have learned as a corpora research kind of guy is not to > ever try to "educate" people. I would just take their sh!t as they > dump it and cleanse, deal with it! > > You would not hear the end of it if I start telling stories of the > kind of cr@p you find out there when you look at the web from that > point of view: from folks at archive.org who would list: "Henry > Valentine Miller", "Henry V. Miller", "Henry Miller", "henry miller", > "Miller, Henry", "Miller, Henry 12-1891 06-1980" apparently as > different authors/"creators", to the gutenberb.org large text bank > including some protagonistic bs in the actual texts, to developers of > libreoffice watermarking text with some cr@p which of course is being > used for "monitoring" purposes by the kinds of folks who put > "intelligence" in the names of the organizations they work for and to > make sure they are making sense they put flags around them when they > fart through their mouths whatever nonsense they think of. > > I had had rehearsing day dreams about becoming a dictator of the > world ;-) and making people do "the right thing" (tm) ... until I had > once an epiphany while watching Trump talk to a media prestitude who > caracteristically wasn't making much sense. After asking a few > questions trying to make sense of what she was saying, prestitude said > "let me formulate it better". Trump quietly sat back saying: "OK, take > your time"!!! > > I was amazed! There you have someone the U.S. media, who as a mouth > piece of the status quo, were being viscerally offensive towards > anything relating to him, including posting on the front page of > mainstream US news papers naked pictures of his wife and mother of his > child one month before she became "the first lady" and he took it > easy, respectfully on her! That was the best case I have noticed so > far of "separating the message from the messenger". I mean people who > erect all those pay walls and somehow see themselves as authoring, > guarding content are not even the messengers and we all have to put up > with their bs. > >> I think your best bet is to digest the URL into a representation. I >> suggest using SipHash+Base64 or Base64URL. SipHash provides collision >> resistance, a uniform distribution, and its fast. SipHash has a very >> good pedigree since it was designed by Jean-Philippe Aumasson and >> Daniel J. Bernstein. The final Base64 or Base64URL encoding ensures >> you stay within printable character range without reserved file system >> characters. > > Thank you I will look into what they did when I get a chance, > > lbrtchx >