On 4/26/23, David Wright <deb...@lionunicorn.co.uk> wrote: > I guess you need the expense of sha256 rather than md5 as you're > downloading the entire web?
I am not downloading the entire web. I have no way of knowing how they entertained those ideations but I think we could use their estimate when they said that approximately 1 million and a half books have been ever published. Think of it! It is not that much data. It would all fit nicely in one hard drive include some searching capability and "bye bye google" will be the name of your movie. On 4/26/23, Dan Ritter <d...@randomstring.org> wrote: > The only characters used in the sha256 hash itself are [a-f] and > [0-9] Yes, I knew that; that is why I could not understand why sha256sum was being "courteous" to me. On 4/26/23, Nicolas George <geo...@nsup.org> wrote: > shaXsum always writes X/4 hexadecimal nibbles then two spaces then the > file name. If the input is from stdin, then the convention is the file > name is ‘-’. > > (Well, not always always: if the file name contains very special > characters, it will use an escaped output format. And there is the -z > option.) On 4/26/23, Thomas Schmitt <scdbac...@gmx.net> wrote: > "FILE" is the minus-sign for standard input. The second blank is there > to indicate the text mode of sha256sum. > Only the first blank is somewhat puzzling. But it's always there. > > > https://www.gnu.org/software/coreutils/manual/html_node/sha2-utilities#sha2-utilities > points to > > https://www.gnu.org/software/coreutils/manual/html_node/md5sum-invocation.html > which says > For each file, ‘md5sum’ outputs by default, the MD5 checksum, a space, > a flag indicating binary or text input mode, and the file name. Binary > mode is indicated with ‘*’, text mode with ‘ ’ (space). Binary mode is > the default on systems where it’s significant, otherwise text mode is > the default. The cksum command always uses binary mode and a ‘ ’ > (space) flag. > > So the first blank can be relied on and thus the proposal by Andy Smith > to use "awk '{print $1}'" is valid. OK, now I see why cutting off the string on the first space that appears is safe. I never saw such cases because I always used sha*sums on files. I would expect if a user enters a string via printf that was all there was to it. Of course, sha*sums can tell apart a file from a string a plain text. On 4/26/23, Jeffrey Walton <noloa...@gmail.com> wrote: > There's no guarantee a URL will map onto a filesystem. > I seem to > recall Stunnel tried to do that in a caching mode, but it had weird > corner cases. (In addition to problems with filesystems that had > character set and path limitations). Well, no; and I am fine with: a) trying to best match both; the URL path as best as possible b) the extra malabarism base64-ing and hashsing the name of the file ... Something I have learned as a corpora research kind of guy is not to ever try to "educate" people. I would just take their sh!t as they dump it and cleanse, deal with it! You would not hear the end of it if I start telling stories of the kind of cr@p you find out there when you look at the web from that point of view: from folks at archive.org who would list: "Henry Valentine Miller", "Henry V. Miller", "Henry Miller", "henry miller", "Miller, Henry", "Miller, Henry 12-1891 06-1980" apparently as different authors/"creators", to the gutenberb.org large text bank including some protagonistic bs in the actual texts, to developers of libreoffice watermarking text with some cr@p which of course is being used for "monitoring" purposes by the kinds of folks who put "intelligence" in the names of the organizations they work for and to make sure they are making sense they put flags around them when they fart through their mouths whatever nonsense they think of. I had had rehearsing day dreams about becoming a dictator of the world ;-) and making people do "the right thing" (tm) ... until I had once an epiphany while watching Trump talk to a media prestitude who caracteristically wasn't making much sense. After asking a few questions trying to make sense of what she was saying, prestitude said "let me formulate it better". Trump quietly sat back saying: "OK, take your time"!!! I was amazed! There you have someone the U.S. media, who as a mouth piece of the status quo, were being viscerally offensive towards anything relating to him, including posting on the front page of mainstream US news papers naked pictures of his wife and mother of his child one month before she became "the first lady" and he took it easy, respectfully on her! That was the best case I have noticed so far of "separating the message from the messenger". I mean people who erect all those pay walls and somehow see themselves as authoring, guarding content are not even the messengers and we all have to put up with their bs. > I think your best bet is to digest the URL into a representation. I > suggest using SipHash+Base64 or Base64URL. SipHash provides collision > resistance, a uniform distribution, and its fast. SipHash has a very > good pedigree since it was designed by Jean-Philippe Aumasson and > Daniel J. Bernstein. The final Base64 or Base64URL encoding ensures > you stay within printable character range without reserved file system > characters. Thank you I will look into what they did when I get a chance, lbrtchx