Re: sha256sum --text generating blank spaces and hyphens?
On 4/27/23, David Christensen wrote: > Please see the OP, step (d). >On 4/26/23, Albretch Mueller wrote: >> a) encode the string name as base64 >> b) calculate the sha256sum of §a >> c) use §b as file name (of course, leaving the original extension as it >> is) >> d) include a "§b_file_name.txt" plain text file descriptor which only >> content is the actual prehash name of that file. I do that because base64 would (must?) work on any OS and the conversion from and to any other encoding is straightforward. As you suggested, I am more friendly to the idea of including hashes of the data payload, even though I think it is not that important, because the actual big problem that corpora research people have is files with exactly the same look and feel and the same content which have different hashes (for example, pdf files). I have been thinking about a way to compute hashes which resemble more faithfully, both, structural and content similarity among files. Do you know of any way to do such thing? The structural aspect should be "easy". It could be handled as DAGs of some sort of XPaths. I was actually going to show to you what I meant, but I was happy to see "I was wrong". I even waited to try it from some other access point. I have used this one liner to show how google/youtube/NSA/"Vladimir Putin"/... was watermarking files for whatever reason, but it worked fine when I was trying to show it to you ;-) _YT_URI=EngW7tLk6R8; _OFL="${_YT_URI}_"$(date +%Y%m%d%H%M%S)".mp4"; ./yt-dlp --verbose --format "mp4" --output "${_OFL}" -- "${_YT_URI}"; ls -l "${_OFL}"; file --brief "${_OFL}"; time sha256sum "${_OFL}" -rwxrwxrwx 1 user user 828540 Aug 15 2022 EngW7tLk6R8_20230501185618.mp4 ISO Media, MP4 v2 [ISO 14496-14] 0b950b88667b5fec35f3dd54005c16e5e742c703a0c776ec6da11b60a4775ae6 EngW7tLk6R8_20230501185618.mp4 -rwxrwxrwx 1 user user 828540 Aug 15 2022 EngW7tLk6R8_20230501185657.mp4 ISO Media, MP4 v2 [ISO 14496-14] 0b950b88667b5fec35f3dd54005c16e5e742c703a0c776ec6da11b60a4775ae6 EngW7tLk6R8_20230501185657.mp4 Max Nikulin (12023-04-28): > And you will quickly face servers that sends incorrectly Content-Type or > intentionally put application/octet-stream with no sniff header to force > browser to save the file instead of opening it e.g. in built-in PDF > reader. Even if not totally syntactic (so you can't functionally solve it with some code), this is a relatively manageable problem, you would: a) take notice of the sites that do such things; b) sniff not only the http headers, but notice the file extension of the file; and c) safe the file to a temp repository for the Linux util "file" to be run on it ... Out of those heuristics you should be able to strategize around such problems. lbrtchx
Re: sha256sum --text generating blank spaces and hyphens?
Max Nikulin (12023-04-29): > > incorrect > This word was stripped in the following quote as well. I was being charitable in not pointing the logical contradiction that if it intentional then it is not incorrect, at least for somebody. > Writing the cited phrase I had in mind an attack which target You can send your movie-plot attacks to Bruce Schneier's next competition; as for me, I will not answer further. -- Nicolas George signature.asc Description: PGP signature
Re: sha256sum --text generating blank spaces and hyphens?
On 28/04/2023 23:42, Max Nikulin wrote: incorrect This word was stripped in the following quote as well. On 29/04/2023 15:50, Nicolas George wrote: Max Nikulin (12023-04-28): value may be intentionally specified I am stripping your mail to just these few words, because they are the core flaw of your argument. If your prefer to ignore other arguments, I am leaving it up to you. Source of Content-Type HTTP header values may be a simple file suffix map like types { text/html html; image/gif gif; image/jpeg jpg; } http://nginx.org/en/docs/http/ngx_http_core_module.html#types If something has been done intentionally, overriding it with an heuristic is a very bad practice. Writing the cited phrase I had in mind an attack which target is to pass an innocently looking file name to specific application usually used for another purpose. As for invalid values that are mistakenly specified, they are a minority, and basing your entire design on a minority of mistakes is also not a very good practice. I consider it is important to notify user that something might go wrong and perhaps inconsistent data have been received. Even if it is a rare case, it should help to perform an appropriate action, to correct a mistake, to minimize damage.
Re: sha256sum --text generating blank spaces and hyphens?
Max Nikulin (12023-04-28): > value may be intentionally specified I am stripping your mail to just these few words, because they are the core flaw of your argument. If something has been done intentionally, overriding it with an heuristic is a very bad practice. As for invalid values that are mistakenly specified, they are a minority, and basing your entire design on a minority of mistakes is also not a very good practice. -- Nicolas George signature.asc Description: PGP signature
Re: sha256sum --text generating blank spaces and hyphens?
On 28/04/2023 15:06, Nicolas George wrote: Max Nikulin (12023-04-28): So URI comparison is not a trivial task. It is an impossible task unless you have specific information about the workings of the website. However some steps toward URL normalization should still be tried. And you will quickly face servers that sends incorrectly Content-Type or intentionally put application/octet-stream with no sniff header to force browser to save the file instead of opening it e.g. in built-in PDF reader. So what? Usually I would trust libmagic/file(1) more than the content-type header. HTTP server may send header depending on file extension. Of course, there are cases when info provided by libmagic may be extended by Content-Type or file suffix (in URI path or download file name hint in HTTP headers): XPI browser extensions are ZIP files. Plain text file may contain markdown or reStructured text markup. You regret absence of standard way to store file type, but incorrect value may be intentionally specified there. I consider heuristics unavoidable whether with standardized place or without it.
Re: sha256sum --text generating blank spaces and hyphens?
Max Nikulin (12023-04-28): > So URI comparison is not a trivial task. It is an impossible task unless you have specific information about the workings of the website. > And you will quickly face servers that sends incorrectly Content-Type or > intentionally put application/octet-stream with no sniff header to force > browser to save the file instead of opening it e.g. in built-in PDF reader. So what? -- Nicolas George signature.asc Description: PGP signature
Re: sha256sum --text generating blank spaces and hyphens?
On 26/04/2023 21:33, Albretch Mueller wrote: a) the crazy long name b) its base64 representation c) §b's sha256sum representation which is the one used for the file name and the log of the download. I see no point in base64 step since sha may be calculated for original URI directly. However an important step of URI normalization is missed: - often http: and https: are alternatives - domain name may contain unicode characters or be represented as pure ASCII punycode - #anchors (sometimes empty #) at the end of URI usually does not change served content. It may be abused however by some web application to provide content dependent of anchors. Or a web page may hide parts of its content using CSS depending on the anchor. So its stripping may cause troubles. - Session or user activity tracking query ("search") parameters that must be stripped for archival purposes - Some parts of URI may be percent encoded keeping equivalence with "canonical" URI - Web page may suggest "canonical" URL, but sometimes it is a misleading hint. So URI comparison is not a trivial task. Another point is that the same page may be saved multiple times, so URI hash is not enough for unique key. On 26/04/2023 21:48, Nicolas George wrote: OTOH, HTTP does have a place to state the type of the file, and the extension in URLs is not reliable: if you want to do it properly, you must set your local file extension based on the Content-Type response header. And you will quickly face servers that sends incorrectly Content-Type or intentionally put application/octet-stream with no sniff header to force browser to save the file instead of opening it e.g. in built-in PDF reader.
Re: sha256sum --text generating blank spaces and hyphens?
On 4/27/23 01:04, Nicolas George wrote: David Christensen (12023-04-26): My suggestion assumes that the URL => hash => content mapping is saved somehow. That is an assumption that needed to be made explicit from the start. For example, save the content in a file named after the hash and save the URL in a file whose name is the hash plus a suffix. Finding a document by URL then becomes a grep(1) invocation. This is not very efficient. Please see the OP, step (d). You are free to propose better solutions. On 4/26/23 21:02, David Christensen wrote: > Things get more interesting when you approach the problem as a database. > Save the content wherever and put the metadata into a table -- content > hash (primary key), URL, download timestamp, author, subject, title, > keywords, etc.. Create fully inverted indexes. Create a search engine. > Create a spider. Implementation could range from a CSV/TSV flat-file > and shell/P* scripts, to a desktop database/UI, to a LAMP stack, and > beyond (NoSQL, N-tier). There are distributed file sharing systems > based on such ideas. David
Re: sha256sum --text generating blank spaces and hyphens?
David Christensen (12023-04-26): > My suggestion assumes that the URL => hash => content mapping is saved > somehow. That is an assumption that needed to be made explicit from the start. > For example, save the content in a file named after the hash and > save the URL in a file whose name is the hash plus a suffix. Finding a > document by URL then becomes a grep(1) invocation. This is not very efficient. -- Nicolas George
Re: sha256sum --text generating blank spaces and hyphens?
On 4/27/23, Max Nikulin wrote: > I have never tried: "Open-source self-hosted web archiving" > https://github.com/ArchiveBox/ArchiveBox > > This one allows to save selected part of a page: > https://github.com/danny0838/webscrapbook/ Thank you for keeping me busy! From their recommendations: https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community https://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives https://coptr.digipres.org/index.php/Main_Page ~ However, what I have in mind is definitely more than archiving, which would be only the first phase of it. // __ [Corpora-List] towards a "pan document format" (pun intended) . . . https://list.elra.info/mailman3/hyperkitty/list/corp...@list.elra.info/message/4AULI3UUQ7BQG5ANFYGEEL7FXQXIILYN/ ~ In particular, I am interested in a corpus of "universally appealing writers" // __ list of authors and their work ... https://list.elra.info/mailman3/hyperkitty/list/corp...@list.elra.info/thread/5PFZUBNLRWW2FDHDWHPKZYOMAGZLOWXG/#4BTSFS5OCUFWVWU4ZDSBJ765DQFWWI7B/ ~ lbrtchx
Re: sha256sum --text generating blank spaces and hyphens?
On 27/04/2023 11:02, David Christensen wrote: Things get more interesting when you approach the problem as a database. Save the content wherever and put the metadata into a table -- content hash (primary key), URL, download timestamp, author, subject, title, keywords, etc.. Create fully inverted indexes. Create a search engine. Create a spider. Implementation could range from a CSV/TSV flat-file and shell/P* scripts, to a desktop database/UI, to a LAMP stack, and beyond (NoSQL, N-tier). There are distributed file sharing systems based on such ideas. I have never tried: "Open-source self-hosted web archiving" https://github.com/ArchiveBox/ArchiveBox This one allows to save selected part of a page: https://github.com/danny0838/webscrapbook/
Re: sha256sum --text generating blank spaces and hyphens?
On 4/26/23 16:21, Albretch Mueller wrote: On 4/26/23, David Christensen wrote: I suggest hashing the document content rather than the URL. This would work nicely for static documents. What do you mean by "hashing the document content"? 2023-04-26 21:03:08 dpchrist@taz ~ $ touch foo 2023-04-26 21:03:12 dpchrist@taz ~ $ sha256sum foo e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 foo In this case, the content is an empty string and the hexadecimal encoding of the the SHA256 hash is "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855". How would that help when what you are trying to do is cleanse and canonize texts as best as you could to find relationships among their text segments? lbrtchx * Each unique text would be stored once regardless of how many URL's link to it. * If the content at a URL changes, the new content will have a new hash. So, the new content will be saved and the old content will be preserved (instead of the new content overwriting the old content). * With regard to my response to the post by Nicolas George, a database of metadata could benefit analysis regardless of the scheme used to name content files. David
Re: sha256sum --text generating blank spaces and hyphens?
On 4/26/23 15:48, Nicolas George wrote: David Christensen (12023-04-26): I suggest hashing the document content rather than the URL. This would work nicely for static documents. That will be very convenient to retrieve the document content from the URL. My suggestion assumes that the URL => hash => content mapping is saved somehow. For example, save the content in a file named after the hash and save the URL in a file whose name is the hash plus a suffix. Finding a document by URL then becomes a grep(1) invocation. Things get more interesting when you approach the problem as a database. Save the content wherever and put the metadata into a table -- content hash (primary key), URL, download timestamp, author, subject, title, keywords, etc.. Create fully inverted indexes. Create a search engine. Create a spider. Implementation could range from a CSV/TSV flat-file and shell/P* scripts, to a desktop database/UI, to a LAMP stack, and beyond (NoSQL, N-tier). There are distributed file sharing systems based on such ideas. David
Re: sha256sum --text generating blank spaces and hyphens?
On 4/26/23, David Christensen wrote: > I suggest hashing the document content rather than the URL. This would > work nicely for static documents. What do you mean by "hashing the document content"? How would that help when what you are trying to do is cleanse and canonize texts as best as you could to find relationships among their text segments? lbrtchx
Re: sha256sum --text generating blank spaces and hyphens?
David Christensen (12023-04-26): > I suggest hashing the document content rather than the URL. This would work > nicely for static documents. That will be very convenient to retrieve the document content from the URL. -- Nicolas George
Re: sha256sum --text generating blank spaces and hyphens?
On 4/26/23 00:41, Albretch Mueller wrote: This is not a debian question per se (more like a Linux bash one), but I wasn't able to find an answer on the Internet. Here is first the problem I am having before you start reading a conspiracy theory into it ;-) I need to somehow map URL on the web to a local file, but you can't do that for two main reasons: 1) URLs are free text 2) which people take to their heart's content. Take for example: https://dokumen.pub/qdownload/nietzsche-und-der-deutsche-geist-band-4-ausbreitung-und-wirkung-des-nietzscheschen-werkes-im-deutschen-sprachraum-bis-zum-ende-des-zweiten-weltkrieges-ein-schrifttumsverzeichnis-der-jahre-1867-1945-ergnzungen-berichtigungen-und-gesamtverzeichnisse-zu-den-bnden-i-iii-9783110202861-9783110189865-3110189860.html that file and the pdf you would download I need to map to a local directory looking like: ... /pub/dokumen/qdownload/ ... but the file name (excluding the extension) is 306 characters long, which Windows NTFS would not swallow. There may be also funky rules regarding character sets and where in a string certain chars may be used; so, as a way to work around those kinds of problems I: a) encode the string name as base64 b) calculate the sha256sum of §a c) use §b as file name (of course, leaving the original extension as it is) d) include a "§b_file_name.txt" plain text file decriptor which only content is the actual prehash name of that file. https://dokumen.pub/qdownload/nietzsche-und-der-deutsche-geist-band-4-ausbreitung-und-wirkung-des-nietzscheschen-werkes-im-deutschen-sprachraum-bis-zum-ende-des-zweiten-weltkrieges-ein-schrifttumsverzeichnis-der-jahre-1867-1945-ergnzungen-berichtigungen-und-gesamtverzeichnisse-zu-den-bnden-i-iii-9783110202861-9783110189865-3110189860.html _TXT="nietzsche-und-der-deutsche-geist-band-4-ausbreitung-und-wirkung-des-nietzscheschen-werkes-im-deutschen-sprachraum-bis-zum-ende-des-zweiten-weltkrieges-ein-schrifttumsverzeichnis-der-jahre-1867-1945-ergnzungen-berichtigungen-und-gesamtverzeichnisse-zu-den-bnden-i-iii-9783110202861-9783110189865-3110189860" _B64TXTENC=$(printf '%s' "${_TXT}" | base64 ) echo "// __ \$_B64TXTENC: |${_B64TXTENC}|" _B64TXTDEC=$(printf '%s' "${_B64TXTENC}" | base64 --decode) echo "// __ \$_B64TXTDEC: |${_B64TXTDEC}|" if [[ "${_TXT}" == "${_B64TXTDEC}" ]]; then echo "// __ [[ \${_TXT} == \${_B64TXTDEC} ]]: |${_TXT}|" _SHA256=$(printf '%s' "${_TXT}" | sha256sum --text ) echo "// __ \$_SHA256: |${_SHA256}|" fi // __ $_SHA256: |7d5895cb24ab49692a8ad495e036074fec8e61b22040544f02a9b69c926dbdeb -| I am trying to avoid funky characters and sha256sum --text still generates them!?! I work like this because I need replicate the original URL as a local path in a way that would be compatible any file system. Do you know of a better way to deal with such issues? lbrtchx I will assume you have solved the sha256sum output issue. (I would use Perl and Digest::SHA.) I suggest hashing the document content rather than the URL. This would work nicely for static documents. David
Re: sha256sum --text generating blank spaces and hyphens?
Hello, On Wed, Apr 26, 2023 at 08:30:01PM +, Albretch Mueller wrote: > On 4/26/23, Dan Ritter wrote: > > The only characters used in the sha256 hash itself are [a-f] and > > [0-9] > > Yes, I knew that; that is why I could not understand why sha256sum > was being "courteous" to me. The man page is very clear on what the output will be, and just running it on a few files should also make it obvious to you, also, all the other sha*sum and md5sum utilities work pretty much the same way. So I don't know why this comes as a surprise, but OK. > OK, now I see why cutting off the string on the first space that > appears is safe. I never saw such cases because I always used sha*sums > on files. But it does the same thing with files… Thanks, Andy -- https://bitfolk.com/ -- No-nonsense VPS hosting
Re: sha256sum --text generating blank spaces and hyphens?
On 4/26/23, David Wright wrote: > I guess you need the expense of sha256 rather than md5 as you're > downloading the entire web? I am not downloading the entire web. I have no way of knowing how they entertained those ideations but I think we could use their estimate when google said that approximately 1 million and a half books have been ever published. Think of it! It is not that much data. It would all fit nicely in one hard drive; include some searching capability and "bye bye google" will be the name of your movie. At times you need to gain a sense of things before going into exposed mode to search for something (which these days means making sure you are not being baited into something else) On 4/26/23, Dan Ritter wrote: > The only characters used in the sha256 hash itself are [a-f] and > [0-9] Yes, I knew that; that is why I could not understand why sha256sum was being "courteous" to me. On 4/26/23, Nicolas George wrote: > shaXsum always writes X/4 hexadecimal nibbles then two spaces then the > file name. If the input is from stdin, then the convention is the file > name is ‘-’. > > (Well, not always always: if the file name contains very special > characters, it will use an escaped output format. And there is the -z > option.) On 4/26/23, Thomas Schmitt wrote: > "FILE" is the minus-sign for standard input. The second blank is there > to indicate the text mode of sha256sum. > Only the first blank is somewhat puzzling. But it's always there. > > > https://www.gnu.org/software/coreutils/manual/html_node/sha2-utilities#sha2-utilities > points to > > https://www.gnu.org/software/coreutils/manual/html_node/md5sum-invocation.html > which says > For each file, ‘md5sum’ outputs by default, the MD5 checksum, a space, > a flag indicating binary or text input mode, and the file name. Binary > mode is indicated with ‘*’, text mode with ‘ ’ (space). Binary mode is > the default on systems where it’s significant, otherwise text mode is > the default. The cksum command always uses binary mode and a ‘ ’ > (space) flag. > > So the first blank can be relied on and thus the proposal by Andy Smith > to use "awk '{print $1}'" is valid. OK, now I see why cutting off the string on the first space that appears is safe. I never saw such cases because I always used sha*sums on files. I would expect if a user enters a string via printf that was all there was to it. Of course, sha*sums can tell apart a file from a text string. On 4/26/23, Jeffrey Walton wrote: > There's no guarantee a URL will map onto a filesystem. > I seem to > recall Stunnel tried to do that in a caching mode, but it had weird > corner cases. (In addition to problems with filesystems that had > character set and path limitations). Well, no; and I am fine with: a) trying to best match both; the URL path as best as possible b) the extra malabarism base64-ing and hashsing the name of the file ... Something I have learned as a corpora research kind of guy is not to ever try to "educate" people. I would just take their sh!t as they dump it and cleanse, deal with it! You would not hear the end of it if I start telling stories of the kind of cr@p you find out there when you look at the web from that point of view. > I think your best bet is to digest the URL into a representation. I > suggest using SipHash+Base64 or Base64URL. SipHash provides collision > resistance, a uniform distribution, and its fast. SipHash has a very > good pedigree since it was designed by Jean-Philippe Aumasson and > Daniel J. Bernstein. The final Base64 or Base64URL encoding ensures > you stay within printable character range without reserved file system > characters. Thank you I will look into what they did when I get a chance, lbrtchx On 4/26/23, Albretch Mueller wrote: > On 4/26/23, David Wright wrote: >> I guess you need the expense of sha256 rather than md5 as you're >> downloading the entire web? > > I am not downloading the entire web. I have no way of knowing how > they entertained those ideations but I think we could use their > estimate when they said that approximately 1 million and a half books > have been ever published. Think of it! It is not that much data. It > would all fit nicely in one hard drive include some searching > capability and "bye bye google" will be the name of your movie. > > On 4/26/23, Dan Ritter wrote: >> The only characters used in the sha256 hash itself are [a-f] and >> [0-9] > > Yes, I knew that; that is why I could not understand why sha256sum > was being "courteous" to me. > > On 4/26/23, Nicolas George wrote: >> shaXsum always writes X/4 hexadecimal nibbles then two spaces then the >> file name. If the input is from stdin, then the convention is the file >> name is ‘-’. >> >> (Well, not always always: if the file name contains very special >> characters, it will use an escaped output format. And there is the -z >> option.) > > On 4/26/23, Thomas Schmitt wrote: >> "FILE" is the minus-sign for standard
Re: sha256sum --text generating blank spaces and hyphens?
On 4/26/23, David Wright wrote: > I guess you need the expense of sha256 rather than md5 as you're > downloading the entire web? I am not downloading the entire web. I have no way of knowing how they entertained those ideations but I think we could use their estimate when they said that approximately 1 million and a half books have been ever published. Think of it! It is not that much data. It would all fit nicely in one hard drive include some searching capability and "bye bye google" will be the name of your movie. On 4/26/23, Dan Ritter wrote: > The only characters used in the sha256 hash itself are [a-f] and > [0-9] Yes, I knew that; that is why I could not understand why sha256sum was being "courteous" to me. On 4/26/23, Nicolas George wrote: > shaXsum always writes X/4 hexadecimal nibbles then two spaces then the > file name. If the input is from stdin, then the convention is the file > name is ‘-’. > > (Well, not always always: if the file name contains very special > characters, it will use an escaped output format. And there is the -z > option.) On 4/26/23, Thomas Schmitt wrote: > "FILE" is the minus-sign for standard input. The second blank is there > to indicate the text mode of sha256sum. > Only the first blank is somewhat puzzling. But it's always there. > > > https://www.gnu.org/software/coreutils/manual/html_node/sha2-utilities#sha2-utilities > points to > > https://www.gnu.org/software/coreutils/manual/html_node/md5sum-invocation.html > which says > For each file, ‘md5sum’ outputs by default, the MD5 checksum, a space, > a flag indicating binary or text input mode, and the file name. Binary > mode is indicated with ‘*’, text mode with ‘ ’ (space). Binary mode is > the default on systems where it’s significant, otherwise text mode is > the default. The cksum command always uses binary mode and a ‘ ’ > (space) flag. > > So the first blank can be relied on and thus the proposal by Andy Smith > to use "awk '{print $1}'" is valid. OK, now I see why cutting off the string on the first space that appears is safe. I never saw such cases because I always used sha*sums on files. I would expect if a user enters a string via printf that was all there was to it. Of course, sha*sums can tell apart a file from a string a plain text. On 4/26/23, Jeffrey Walton wrote: > There's no guarantee a URL will map onto a filesystem. > I seem to > recall Stunnel tried to do that in a caching mode, but it had weird > corner cases. (In addition to problems with filesystems that had > character set and path limitations). Well, no; and I am fine with: a) trying to best match both; the URL path as best as possible b) the extra malabarism base64-ing and hashsing the name of the file ... Something I have learned as a corpora research kind of guy is not to ever try to "educate" people. I would just take their sh!t as they dump it and cleanse, deal with it! You would not hear the end of it if I start telling stories of the kind of cr@p you find out there when you look at the web from that point of view: from folks at archive.org who would list: "Henry Valentine Miller", "Henry V. Miller", "Henry Miller", "henry miller", "Miller, Henry", "Miller, Henry 12-1891 06-1980" apparently as different authors/"creators", to the gutenberb.org large text bank including some protagonistic bs in the actual texts, to developers of libreoffice watermarking text with some cr@p which of course is being used for "monitoring" purposes by the kinds of folks who put "intelligence" in the names of the organizations they work for and to make sure they are making sense they put flags around them when they fart through their mouths whatever nonsense they think of. I had had rehearsing day dreams about becoming a dictator of the world ;-) and making people do "the right thing" (tm) ... until I had once an epiphany while watching Trump talk to a media prestitude who caracteristically wasn't making much sense. After asking a few questions trying to make sense of what she was saying, prestitude said "let me formulate it better". Trump quietly sat back saying: "OK, take your time"!!! I was amazed! There you have someone the U.S. media, who as a mouth piece of the status quo, were being viscerally offensive towards anything relating to him, including posting on the front page of mainstream US news papers naked pictures of his wife and mother of his child one month before she became "the first lady" and he took it easy, respectfully on her! That was the best case I have noticed so far of "separating the message from the messenger". I mean people who erect all those pay walls and somehow see themselves as authoring, guarding content are not even the messengers and we all have to put up with their bs. > I think your best bet is to digest the URL into a representation. I > suggest using SipHash+Base64 or Base64URL. SipHash provides collision > resistance, a uniform distribution, and its fast. SipHash has a very > good pe
Re: sha256sum --text generating blank spaces and hyphens?
On Wed, Apr 26, 2023 at 02:33:03PM +, Albretch Mueller wrote: [...] > because I would like to include the three strings in the file descriptor: > a) the crazy long name > b) its base64 representation > c) §b's sha256sum representation which is the one used for the file > name and the log of the download. [...] It's your work, of course. > >> // __ $_SHA256: > >> |7d5895cb24ab49692a8ad495e036074fec8e61b22040544f02a9b69c926dbdeb -| > > > > > > I only see harmless hexadecimal chars there. > > > >> I am trying to avoid funky characters and sha256sum --text still > >> generates them!?! > > > > Where are there "funky chars"? > > This is the first time I have seen blank spaces and hyphens in a text > segment's sum. Those characters might be confusing. Ah -- I thing someone else (I think it was Dan, sorry if my memory fails me) pointed that out already. The dash is the "file name" (which in this case was stdin, this follows a widespread convention). All those sums output the sum (never ever spaces in there), a whitespace, then the file name. Background: you can give them multiple args, then they generate a list of sums and names, which you then can conveniently use with the -c option to see whether any of the files has changed. > > Besides, I don't think --text does what you think it does. Quoting > > the manpage: > > > > "Note: There is no difference between binary mode and text > >mode on GNU systems." > > Thank you. I was playing with different options to see if that was > the reason I was getting those white spaces and hyphens at the end. > > Why is that happening? How could it be avoided? COuld you set the > characters used in the representation of a sum? You just cut it out with, e.g. 'cut' like so: sha256sum | cut -d' ' -f1 Cheers -- t signature.asc Description: PGP signature
Re: sha256sum --text generating blank spaces and hyphens?
On Wed, Apr 26, 2023 at 3:42 AM Albretch Mueller wrote: > > This is not a debian question per se (more like a Linux bash one), > but I wasn't able to find an answer on the Internet. > > Here is first the problem I am having before you start reading a > conspiracy theory into it ;-) > > I need to somehow map URL on the web to a local file, but you can't > do that for two main reasons: > > 1) URLs are free text > 2) which people take to their heart's content. > > Take for example: > > > https://dokumen.pub/qdownload/nietzsche-und-der-deutsche-geist-band-4-ausbreitung-und-wirkung-des-nietzscheschen-werkes-im-deutschen-sprachraum-bis-zum-ende-des-zweiten-weltkrieges-ein-schrifttumsverzeichnis-der-jahre-1867-1945-ergnzungen-berichtigungen-und-gesamtverzeichnisse-zu-den-bnden-i-iii-9783110202861-9783110189865-3110189860.html > > that file and the pdf you would download I need to map to a local > directory looking like: ... /pub/dokumen/qdownload/ ... > > but the file name (excluding the extension) is 306 characters long, > which Windows NTFS would not swallow. There may be also funky rules > regarding character sets and where in a string certain chars may be > used; so, as a way to work around those kinds of problems I: > > a) encode the string name as base64 > b) calculate the sha256sum of §a > c) use §b as file name (of course, leaving the original extension as it is) > d) include a "§b_file_name.txt" plain text file decriptor which only > content is the actual prehash name of that file. > > > > https://dokumen.pub/qdownload/nietzsche-und-der-deutsche-geist-band-4-ausbreitung-und-wirkung-des-nietzscheschen-werkes-im-deutschen-sprachraum-bis-zum-ende-des-zweiten-weltkrieges-ein-schrifttumsverzeichnis-der-jahre-1867-1945-ergnzungen-berichtigungen-und-gesamtverzeichnisse-zu-den-bnden-i-iii-9783110202861-9783110189865-3110189860.html > > _TXT="nietzsche-und-der-deutsche-geist-band-4-ausbreitung-und-wirkung-des-nietzscheschen-werkes-im-deutschen-sprachraum-bis-zum-ende-des-zweiten-weltkrieges-ein-schrifttumsverzeichnis-der-jahre-1867-1945-ergnzungen-berichtigungen-und-gesamtverzeichnisse-zu-den-bnden-i-iii-9783110202861-9783110189865-3110189860" > _B64TXTENC=$(printf '%s' "${_TXT}" | base64 ) > echo "// __ \$_B64TXTENC: |${_B64TXTENC}|" > _B64TXTDEC=$(printf '%s' "${_B64TXTENC}" | base64 --decode) > echo "// __ \$_B64TXTDEC: |${_B64TXTDEC}|" > if [[ "${_TXT}" == "${_B64TXTDEC}" ]]; then > echo "// __ [[ \${_TXT} == \${_B64TXTDEC} ]]: |${_TXT}|" > _SHA256=$(printf '%s' "${_TXT}" | sha256sum --text ) > echo "// __ \$_SHA256: |${_SHA256}|" > fi > > // __ $_SHA256: > |7d5895cb24ab49692a8ad495e036074fec8e61b22040544f02a9b69c926dbdeb -| > > I am trying to avoid funky characters and sha256sum --text still > generates them!?! > > I work like this because I need replicate the original URL as a local > path in a way that would be compatible any file system. > > Do you know of a better way to deal with such issues? There's no guarantee a URL will map onto a filesystem. I seem to recall Stunnel tried to do that in a caching mode, but it had weird corner cases. (In addition to problems with filesystems that had character set and path limitations). I think your best bet is to digest the URL into a representation. I suggest using SipHash+Base64 or Base64URL. SipHash provides collision resistance, a uniform distribution, and its fast. SipHash has a very good pedigree since it was designed by Jean-Philippe Aumasson and Daniel J. Bernstein. The final Base64 or Base64URL encoding ensures you stay within printable character range without reserved file system characters. Jeff
Re: sha256sum --text generating blank spaces and hyphens?
Albretch Mueller wrote: > On 4/26/23, Andy Smith wrote: > > If you're referring to the space and then the file name ("-" in case > > of stdin) on the end, you can just select only the first output up > > to whitespace with e.g. awk: > > > > _SHA256=$(printf '%s' "${_TXT}" | sha256sum | awk '{print $1}') > > Yes, you could but I am trying to find out why this is happening > instead of truncating the string when a space appears because I don't > think what would be safe. The white space and the - are not part of the sha256, they are emitted by sha256sum as a courtesy. You can safely remove everything starting with the first whitespace. > >> // __ $_SHA256: > >> |7d5895cb24ab49692a8ad495e036074fec8e61b22040544f02a9b69c926dbdeb -| > > > > I only see harmless hexadecimal chars there. > > > >> I am trying to avoid funky characters and sha256sum --text still > >> generates them!?! > > > > Where are there "funky chars"? > > This is the first time I have seen blank spaces and hyphens in a text > segment's sum. Those characters might be confusing. The white space and the - are not part of the sha256, they are emitted by sha256sum as a courtesy. You can safely remove everything starting with the first whitespace. > Why is that happening? How could it be avoided? COuld you set the > characters used in the representation of a sum? The white space and the - are not part of the sha256, they are emitted by sha256sum as a courtesy. You can safely remove everything starting with the first whitespace. The only characters used in the sha256 hash itself are [a-f] and [0-9] -dsr-
Re: sha256sum --text generating blank spaces and hyphens?
Hi, Albretch Mueller wrote: > > > I am trying to avoid funky characters and sha256sum --text still > > > generates them!?! Andy Smith wrote: > > If you're referring to the space and then the file name ("-" in case > > of stdin) on the end, you can just select only the first output up > > to whitespace with e.g. awk: > Yes, you could but I am trying to find out why this is happening > instead of truncating the string when a space appears because I don't > think what would be safe. One of the blanks and the hyphen-or-minus are announced by the man page: The default mode is to print a line with checksum, a character indicating input mode ('*' for binary, space for text), and name for each FILE. "FILE" is the minus-sign for standard input. The second blank is there to indicate the text mode of sha256sum. Only the first blank is somewhat puzzling. But it's always there. https://www.gnu.org/software/coreutils/manual/html_node/sha2-utilities#sha2-utilities points to https://www.gnu.org/software/coreutils/manual/html_node/md5sum-invocation.html which says For each file, ‘md5sum’ outputs by default, the MD5 checksum, a space, a flag indicating binary or text input mode, and the file name. Binary mode is indicated with ‘*’, text mode with ‘ ’ (space). Binary mode is the default on systems where it’s significant, otherwise text mode is the default. The cksum command always uses binary mode and a ‘ ’ (space) flag. So the first blank can be relied on and thus the proposal by Andy Smith to use "awk '{print $1}'" is valid. Have a nice day :) Thomas
Re: sha256sum --text generating blank spaces and hyphens?
On Wed 26 Apr 2023 at 14:33:03 (+), Albretch Mueller wrote: > On 4/26/23, to...@tuxteam.de wrote: > >> a) encode the string name as base64 > >> b) calculate the sha256sum of §a > > > > Why the detour over base64? > > because I would like to include the three strings in the file descriptor: > a) the crazy long name > b) its base64 representation The base64 command wraps the output, in case you didn't notice. > c) §b's sha256sum representation which is the one used for the file > name and the log of the download. I guess you need the expense of sha256 rather than md5 as you're downloading the entire web? > I would like to make this scheme "fool (and fail) proof" as they say. > There is no way in earth that a file system messes with all three > aspects of it. > > >> c) use §b as file name (of course, leaving the original extension as it > >> is) > > > > Why the extension? DOS nostalgia? > > The local copies should represent the web URLs as close as possible > in order to minimize "what came from where" kinds of confusions. Also > from the same URL you would then download the corresponding pdf file > with exactly the same name, the only difference being the extension. The extension is part of the name. If you preserve it as is, what happens when it contains a "funky" character. > >> // __ $_SHA256: > >> |7d5895cb24ab49692a8ad495e036074fec8e61b22040544f02a9b69c926dbdeb -| > > > > > > I only see harmless hexadecimal chars there. > > > >> I am trying to avoid funky characters and sha256sum --text still > >> generates them!?! > > > > Where are there "funky chars"? > > This is the first time I have seen blank spaces and hyphens in a text > segment's sum. Those characters might be confusing. You calculated the sha256sum of stdin. - is the name of the input file you encoded. Duh. Cheers, David.
Re: sha256sum --text generating blank spaces and hyphens?
Albretch Mueller (12023-04-26): > Yes, you could but I am trying to find out why this is happening > instead of truncating the string when a space appears because I don't > think what would be safe. shaXsum always writes X/4 hexadecimal nibbles then two spaces then the file name. If the input is from stdin, then the convention is the file name is ‘-’. (Well, not always always: if the file name contains very special characters, it will use an escaped output format. And there is the -z option.) For your case, just use “cut -c 1-64”. > > Why the detour over base64? > because I would like to include the three strings in the file descriptor: > a) the crazy long name > b) its base64 representation > c) §b's sha256sum representation which is the one used for the file > name and the log of the download. Then do so, but in c, store the SHA-256 of the URL, not the SHA-256 of the base64 encoding of the URL. > The local copies should represent the web URLs as close as possible > in order to minimize "what came from where" kinds of confusions. You are right to do so. Many utilities rely on the extension to decide what to do with a file. Lacking a standardized place to store the file type, it is the most robust options. Applications that rely on probing and heuristics, like libfile and co., are in fact much less reliable and a lot more annoying. (Also, if we were to want a standardized place to store the file type, a lot of user interface would have to be revamped.) OTOH, HTTP does have a place to state the type of the file, and the extension in URLs is not reliable: if you want to do it properly, you must set your local file extension based on the Content-Type response header. > Also > from the same URL you would then download the corresponding pdf file > with exactly the same name, the only difference being the extension. Then you need to exclude the extension from the URL, but a lot of URLs do not have extensions and you should be using the Content-Type instead. This feature is a pipe dream. > This is the first time I have seen blank spaces and hyphens in a text > segment's sum. Those characters might be confusing. See above. Regards, -- Nicolas George signature.asc Description: PGP signature
Re: sha256sum --text generating blank spaces and hyphens?
On 4/26/23, Andy Smith wrote: > If you're referring to the space and then the file name ("-" in case > of stdin) on the end, you can just select only the first output up > to whitespace with e.g. awk: > > _SHA256=$(printf '%s' "${_TXT}" | sha256sum | awk '{print $1}') Yes, you could but I am trying to find out why this is happening instead of truncating the string when a space appears because I don't think what would be safe. > These web sites can change their URLs at any time you know, so it > may not be worth trying to replicate their structure locally. yes, I know and my way to deal with such issues is: a) by including in the name of the web log of the download the date and time ... b) once the data file is downloaded, say a pdf file of an old book or some publication, all the metadata in the front and back pages of the book are OCRed, the actual title, ISBN, publishing date ... On 4/26/23, to...@tuxteam.de wrote: >> a) encode the string name as base64 >> b) calculate the sha256sum of §a > > Why the detour over base64? because I would like to include the three strings in the file descriptor: a) the crazy long name b) its base64 representation c) §b's sha256sum representation which is the one used for the file name and the log of the download. I would like to make this scheme "fool (and fail) proof" as they say. There is no way in earth that a file system messes with all three aspects of it. >> c) use §b as file name (of course, leaving the original extension as it >> is) > > Why the extension? DOS nostalgia? The local copies should represent the web URLs as close as possible in order to minimize "what came from where" kinds of confusions. Also from the same URL you would then download the corresponding pdf file with exactly the same name, the only difference being the extension. >> // __ $_SHA256: >> |7d5895cb24ab49692a8ad495e036074fec8e61b22040544f02a9b69c926dbdeb -| > > > I only see harmless hexadecimal chars there. > >> I am trying to avoid funky characters and sha256sum --text still >> generates them!?! > > Where are there "funky chars"? This is the first time I have seen blank spaces and hyphens in a text segment's sum. Those characters might be confusing. > Besides, I don't think --text does what you think it does. Quoting > the manpage: > > "Note: There is no difference between binary mode and text >mode on GNU systems." Thank you. I was playing with different options to see if that was the reason I was getting those white spaces and hyphens at the end. Why is that happening? How could it be avoided? COuld you set the characters used in the representation of a sum? lbrtchx
Re: sha256sum --text generating blank spaces and hyphens?
Hello, On Wed, Apr 26, 2023 at 07:41:56AM +, Albretch Mueller wrote: > _SHA256=$(printf '%s' "${_TXT}" | sha256sum --text ) > echo "// __ \$_SHA256: |${_SHA256}|" […] > // __ $_SHA256: > |7d5895cb24ab49692a8ad495e036074fec8e61b22040544f02a9b69c926dbdeb -| > > I am trying to avoid funky characters and sha256sum --text still > generates them!?! If you're referring to the space and then the file name ("-" in case of stdin) on the end, you can just select only the first output up to whitespace with e.g. awk: _SHA256=$(printf '%s' "${_TXT}" | sha256sum | awk '{print $1}') Your use of "--text" does nothing by the way. > I work like this because I need replicate the original URL as a local > path in a way that would be compatible any file system. These web sites can change their URLs at any time you know, so it may not be worth trying to replicate their structure locally. Also maybe you want some sort of web site mirroring solution. Cheers, Andy -- https://bitfolk.com/ -- No-nonsense VPS hosting
Re: sha256sum --text generating blank spaces and hyphens?
On Wed, Apr 26, 2023 at 07:41:56AM +, Albretch Mueller wrote: > This is not a debian question per se (more like a Linux bash one), > but I wasn't able to find an answer on the Internet. > > Here is first the problem I am having before you start reading a > conspiracy theory into it ;-) > > I need to somehow map URL on the web to a local file, but you can't > do that for two main reasons: OK. [...] > but the file name (excluding the extension) is 306 characters long, > which Windows NTFS [...] There's the first problem. > a) encode the string name as base64 > b) calculate the sha256sum of §a Why the detour over base64? > c) use §b as file name (of course, leaving the original extension as it is) Why the extension? DOS nostalgia? > d) include a "§b_file_name.txt" plain text file decriptor which only > content is the actual prehash name of that file. OK. [...] > // __ $_SHA256: > |7d5895cb24ab49692a8ad495e036074fec8e61b22040544f02a9b69c926dbdeb -| I only see harmless hexadecimal chars there. > I am trying to avoid funky characters and sha256sum --text still > generates them!?! Where are there "funky chars"? > I work like this because I need replicate the original URL as a local > path in a way that would be compatible any file system. > > Do you know of a better way to deal with such issues? Besides, I don't think --text does what you think it does. Quoting the manpage: "Note: There is no difference between binary mode and text mode on GNU systems." This is about *reading* the input in text or binary mode, which are equivalent in most civilised operating systems. Cheers -- t signature.asc Description: PGP signature
sha256sum --text generating blank spaces and hyphens?
This is not a debian question per se (more like a Linux bash one), but I wasn't able to find an answer on the Internet. Here is first the problem I am having before you start reading a conspiracy theory into it ;-) I need to somehow map URL on the web to a local file, but you can't do that for two main reasons: 1) URLs are free text 2) which people take to their heart's content. Take for example: https://dokumen.pub/qdownload/nietzsche-und-der-deutsche-geist-band-4-ausbreitung-und-wirkung-des-nietzscheschen-werkes-im-deutschen-sprachraum-bis-zum-ende-des-zweiten-weltkrieges-ein-schrifttumsverzeichnis-der-jahre-1867-1945-ergnzungen-berichtigungen-und-gesamtverzeichnisse-zu-den-bnden-i-iii-9783110202861-9783110189865-3110189860.html that file and the pdf you would download I need to map to a local directory looking like: ... /pub/dokumen/qdownload/ ... but the file name (excluding the extension) is 306 characters long, which Windows NTFS would not swallow. There may be also funky rules regarding character sets and where in a string certain chars may be used; so, as a way to work around those kinds of problems I: a) encode the string name as base64 b) calculate the sha256sum of §a c) use §b as file name (of course, leaving the original extension as it is) d) include a "§b_file_name.txt" plain text file decriptor which only content is the actual prehash name of that file. https://dokumen.pub/qdownload/nietzsche-und-der-deutsche-geist-band-4-ausbreitung-und-wirkung-des-nietzscheschen-werkes-im-deutschen-sprachraum-bis-zum-ende-des-zweiten-weltkrieges-ein-schrifttumsverzeichnis-der-jahre-1867-1945-ergnzungen-berichtigungen-und-gesamtverzeichnisse-zu-den-bnden-i-iii-9783110202861-9783110189865-3110189860.html _TXT="nietzsche-und-der-deutsche-geist-band-4-ausbreitung-und-wirkung-des-nietzscheschen-werkes-im-deutschen-sprachraum-bis-zum-ende-des-zweiten-weltkrieges-ein-schrifttumsverzeichnis-der-jahre-1867-1945-ergnzungen-berichtigungen-und-gesamtverzeichnisse-zu-den-bnden-i-iii-9783110202861-9783110189865-3110189860" _B64TXTENC=$(printf '%s' "${_TXT}" | base64 ) echo "// __ \$_B64TXTENC: |${_B64TXTENC}|" _B64TXTDEC=$(printf '%s' "${_B64TXTENC}" | base64 --decode) echo "// __ \$_B64TXTDEC: |${_B64TXTDEC}|" if [[ "${_TXT}" == "${_B64TXTDEC}" ]]; then echo "// __ [[ \${_TXT} == \${_B64TXTDEC} ]]: |${_TXT}|" _SHA256=$(printf '%s' "${_TXT}" | sha256sum --text ) echo "// __ \$_SHA256: |${_SHA256}|" fi // __ $_SHA256: |7d5895cb24ab49692a8ad495e036074fec8e61b22040544f02a9b69c926dbdeb -| I am trying to avoid funky characters and sha256sum --text still generates them!?! I work like this because I need replicate the original URL as a local path in a way that would be compatible any file system. Do you know of a better way to deal with such issues? lbrtchx