Re: sha256sum --text generating blank spaces and hyphens?

2023-05-01 Thread Albretch Mueller
On 4/27/23, David Christensen  wrote:
> Please see the OP, step (d).

>On 4/26/23, Albretch Mueller  wrote:
>>  a) encode the string name as base64
>> b) calculate the sha256sum of §a
>>  c) use §b as file name (of course, leaving the original extension as it
>> is)
>>  d) include a "§b_file_name.txt" plain text file descriptor which only
>> content is the actual prehash name of that file.

 I do that because base64 would (must?) work on any OS and the
conversion from and to any other encoding is straightforward. As you
suggested, I am more friendly to the idea of including hashes of the
data payload, even though I think it is not that important, because
the actual big problem that corpora research people have is files with
exactly the same look and feel and the same content which have
different hashes (for example, pdf files). I have been thinking about
a way to compute hashes which resemble more faithfully, both,
structural and content similarity among files. Do you know of any way
to do such thing? The structural aspect should be "easy". It could be
handled as DAGs of some sort of XPaths.

  I was actually going to show to you what I meant, but I was happy to
see "I was wrong". I even waited to try it from some other access
point. I have used this one liner to show how
google/youtube/NSA/"Vladimir Putin"/... was watermarking files for
whatever reason, but it worked fine when I was trying to show it to
you ;-)

_YT_URI=EngW7tLk6R8; _OFL="${_YT_URI}_"$(date +%Y%m%d%H%M%S)".mp4";
./yt-dlp --verbose --format "mp4" --output "${_OFL}" -- "${_YT_URI}";
ls -l "${_OFL}"; file --brief "${_OFL}"; time sha256sum "${_OFL}"


-rwxrwxrwx 1 user user 828540 Aug 15  2022 EngW7tLk6R8_20230501185618.mp4
ISO Media, MP4 v2 [ISO 14496-14]
0b950b88667b5fec35f3dd54005c16e5e742c703a0c776ec6da11b60a4775ae6
EngW7tLk6R8_20230501185618.mp4

-rwxrwxrwx 1 user user 828540 Aug 15  2022 EngW7tLk6R8_20230501185657.mp4
ISO Media, MP4 v2 [ISO 14496-14]
0b950b88667b5fec35f3dd54005c16e5e742c703a0c776ec6da11b60a4775ae6
EngW7tLk6R8_20230501185657.mp4

 Max Nikulin (12023-04-28):
> And you will quickly face servers that sends incorrectly Content-Type or
> intentionally put application/octet-stream with no sniff header to force
> browser to save the file instead of opening it e.g. in built-in PDF
> reader.

 Even if not totally syntactic (so you can't functionally solve it
with some code), this is a relatively manageable problem, you would:

 a) take notice of the sites that do such things;
 b) sniff not only the http headers, but notice the file extension of
the file; and
 c) safe the file to a temp repository for the Linux util "file" to be
run on it ...

 Out of those heuristics you should be able to strategize around such problems.

 lbrtchx



Re: sha256sum --text generating blank spaces and hyphens?

2023-04-29 Thread Nicolas George
Max Nikulin (12023-04-29):
> > incorrect
> This word was stripped in the following quote as well.

I was being charitable in not pointing the logical contradiction that if
it intentional then it is not incorrect, at least for somebody.

> Writing the cited phrase I had in mind an attack which target

You can send your movie-plot attacks to Bruce Schneier's next
competition; as for me, I will not answer further.

-- 
  Nicolas George


signature.asc
Description: PGP signature


Re: sha256sum --text generating blank spaces and hyphens?

2023-04-29 Thread Max Nikulin

On 28/04/2023 23:42, Max Nikulin wrote:

incorrect


This word was stripped in the following quote as well.

On 29/04/2023 15:50, Nicolas George wrote:

Max Nikulin (12023-04-28):

  value may be intentionally specified


I am stripping your mail to just these few words, because they are the
core flaw of your argument.


If your prefer to ignore other arguments, I am leaving it up to you. 
Source of Content-Type HTTP header values may be a simple file suffix 
map like


types {
text/html  html;
image/gif  gif;
image/jpeg jpg;
}
http://nginx.org/en/docs/http/ngx_http_core_module.html#types


If something has been done intentionally, overriding it with an
heuristic is a very bad practice.


Writing the cited phrase I had in mind an attack which target is to pass 
an innocently looking file name to specific application usually used for 
another purpose.



As for invalid values that are mistakenly specified, they are a
minority, and basing your entire design on a minority of mistakes is
also not a very good practice.


I consider it is important to notify user that something might go wrong 
and perhaps inconsistent data have been received. Even if it is a rare 
case, it should help to perform an appropriate action, to correct a 
mistake, to minimize damage.





Re: sha256sum --text generating blank spaces and hyphens?

2023-04-29 Thread Nicolas George
Max Nikulin (12023-04-28):
> value may be intentionally specified

I am stripping your mail to just these few words, because they are the
core flaw of your argument.

If something has been done intentionally, overriding it with an
heuristic is a very bad practice.

As for invalid values that are mistakenly specified, they are a
minority, and basing your entire design on a minority of mistakes is
also not a very good practice.

-- 
  Nicolas George


signature.asc
Description: PGP signature


Re: sha256sum --text generating blank spaces and hyphens?

2023-04-28 Thread Max Nikulin

On 28/04/2023 15:06, Nicolas George wrote:

Max Nikulin (12023-04-28):

So URI comparison is not a trivial task.


It is an impossible task unless you have specific information about the
workings of the website.


However some steps toward URL normalization should still be tried.


And you will quickly face servers that sends incorrectly Content-Type or
intentionally put application/octet-stream with no sniff header to force
browser to save the file instead of opening it e.g. in built-in PDF reader.


So what?


Usually I would trust libmagic/file(1) more than the content-type 
header. HTTP server may send header depending on file extension. Of 
course, there are cases when info provided by libmagic may be extended 
by Content-Type or file suffix (in URI path or download file name hint 
in HTTP headers): XPI browser extensions are ZIP files. Plain text file 
may contain markdown or reStructured text markup. You regret absence of 
standard way to store file type, but incorrect value may be 
intentionally specified there. I consider heuristics unavoidable whether 
with standardized place or without it.





Re: sha256sum --text generating blank spaces and hyphens?

2023-04-28 Thread Nicolas George
Max Nikulin (12023-04-28):
> So URI comparison is not a trivial task.

It is an impossible task unless you have specific information about the
workings of the website.

> And you will quickly face servers that sends incorrectly Content-Type or
> intentionally put application/octet-stream with no sniff header to force
> browser to save the file instead of opening it e.g. in built-in PDF reader.

So what?

-- 
  Nicolas George


signature.asc
Description: PGP signature


Re: sha256sum --text generating blank spaces and hyphens?

2023-04-27 Thread Max Nikulin

On 26/04/2023 21:33, Albretch Mueller wrote:

  a) the crazy long name
  b) its base64 representation
  c) §b's sha256sum representation which is the one used for the file
name and the log of the download.


I see no point in base64 step since sha may be calculated for original 
URI directly. However an important step of URI normalization is missed:

- often http: and https: are alternatives
- domain name may contain unicode characters or be represented as pure 
ASCII punycode
- #anchors (sometimes empty #) at the end of URI usually does not change 
served content. It may be abused however by some web application to 
provide content dependent of anchors. Or a web page may hide parts of 
its content using CSS depending on the anchor. So its stripping may 
cause troubles.
- Session or user activity tracking query ("search") parameters that 
must be stripped for archival purposes
- Some parts of URI may be percent encoded keeping equivalence with 
"canonical" URI
- Web page may suggest "canonical" URL, but sometimes it is a misleading 
hint.


So URI comparison is not a trivial task.

Another point is that the same page may be saved multiple times, so URI 
hash is not enough for unique key.


On 26/04/2023 21:48, Nicolas George wrote:

OTOH, HTTP does have a place to state the type of the file, and the
extension in URLs is not reliable: if you want to do it properly, you
must set your local file extension based on the Content-Type response
header.


And you will quickly face servers that sends incorrectly Content-Type or 
intentionally put application/octet-stream with no sniff header to force 
browser to save the file instead of opening it e.g. in built-in PDF reader.




Re: sha256sum --text generating blank spaces and hyphens?

2023-04-27 Thread David Christensen

On 4/27/23 01:04, Nicolas George wrote:

David Christensen (12023-04-26):

My suggestion assumes that the URL => hash => content mapping is saved
somehow.


That is an assumption that needed to be made explicit from the start.


   For example, save the content in a file named after the hash and
save the URL in a file whose name is the hash plus a suffix. Finding a
document by URL then becomes a grep(1) invocation.


This is not very efficient.



Please see the OP, step (d).


You are free to propose better solutions.


On 4/26/23 21:02, David Christensen wrote:

> Things get more interesting when you approach the problem as a database.
>   Save the content wherever and put the metadata into a table -- content
> hash (primary key), URL, download timestamp, author, subject, title,
> keywords, etc..  Create fully inverted indexes.  Create a search engine.
>   Create a spider.  Implementation could range from a CSV/TSV flat-file
> and shell/P* scripts, to a desktop database/UI, to a LAMP stack, and
> beyond (NoSQL, N-tier).  There are distributed file sharing systems
> based on such ideas.


David



Re: sha256sum --text generating blank spaces and hyphens?

2023-04-27 Thread Nicolas George
David Christensen (12023-04-26):
> My suggestion assumes that the URL => hash => content mapping is saved
> somehow.

That is an assumption that needed to be made explicit from the start.

>  For example, save the content in a file named after the hash and
> save the URL in a file whose name is the hash plus a suffix. Finding a
> document by URL then becomes a grep(1) invocation.

This is not very efficient.

-- 
  Nicolas George



Re: sha256sum --text generating blank spaces and hyphens?

2023-04-27 Thread Albretch Mueller
On 4/27/23, Max Nikulin  wrote:
> I have never tried: "Open-source self-hosted web archiving"
> https://github.com/ArchiveBox/ArchiveBox
>
> This one allows to save selected part of a page:
> https://github.com/danny0838/webscrapbook/

 Thank you for keeping me busy! From their recommendations:

 https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community

 https://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives

 https://coptr.digipres.org/index.php/Main_Page
~
 However, what I have in mind is definitely more than archiving, which
would be only the first phase of it.

// __ [Corpora-List] towards a "pan document format" (pun intended) . . .

 
https://list.elra.info/mailman3/hyperkitty/list/corp...@list.elra.info/message/4AULI3UUQ7BQG5ANFYGEEL7FXQXIILYN/
~
 In particular, I am interested in a corpus of "universally appealing writers"

// __ list of authors and their work ...

 
https://list.elra.info/mailman3/hyperkitty/list/corp...@list.elra.info/thread/5PFZUBNLRWW2FDHDWHPKZYOMAGZLOWXG/#4BTSFS5OCUFWVWU4ZDSBJ765DQFWWI7B/
~
 lbrtchx



Re: sha256sum --text generating blank spaces and hyphens?

2023-04-26 Thread Max Nikulin

On 27/04/2023 11:02, David Christensen wrote:
Things get more interesting when you approach the problem as a database. 
  Save the content wherever and put the metadata into a table -- content 
hash (primary key), URL, download timestamp, author, subject, title, 
keywords, etc..  Create fully inverted indexes.  Create a search engine. 
  Create a spider.  Implementation could range from a CSV/TSV flat-file 
and shell/P* scripts, to a desktop database/UI, to a LAMP stack, and 
beyond (NoSQL, N-tier).  There are distributed file sharing systems 
based on such ideas.


I have never tried: "Open-source self-hosted web archiving"
https://github.com/ArchiveBox/ArchiveBox

This one allows to save selected part of a page:
https://github.com/danny0838/webscrapbook/



Re: sha256sum --text generating blank spaces and hyphens?

2023-04-26 Thread David Christensen

On 4/26/23 16:21, Albretch Mueller wrote:

On 4/26/23, David Christensen  wrote:

I suggest hashing the document content rather than the URL.  This would
work nicely for static documents.


  What do you mean by "hashing the document content"?



2023-04-26 21:03:08 dpchrist@taz ~
$ touch foo

2023-04-26 21:03:12 dpchrist@taz ~
$ sha256sum foo
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  foo


In this case, the content is an empty string and the hexadecimal 
encoding of the the SHA256 hash is 
"e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855".




  How would that help when what you are trying to do is cleanse and
canonize texts as best as you could to find relationships among their
text segments?

  lbrtchx



* Each unique text would be stored once regardless of how many URL's 
link to it.


* If the content at a URL changes, the new content will have a new hash. 
 So, the new content will be saved and the old content will be 
preserved (instead of the new content overwriting the old content).


* With regard to my response to the post by Nicolas George, a database 
of metadata could benefit analysis regardless of the scheme used to name 
content files.



David




Re: sha256sum --text generating blank spaces and hyphens?

2023-04-26 Thread David Christensen

On 4/26/23 15:48, Nicolas George wrote:

David Christensen (12023-04-26):

I suggest hashing the document content rather than the URL.  This would work
nicely for static documents.


That will be very convenient to retrieve the document content from the
URL.



My suggestion assumes that the URL => hash => content mapping is saved 
somehow.  For example, save the content in a file named after the hash 
and save the URL in a file whose name is the hash plus a suffix. 
Finding a document by URL then becomes a grep(1) invocation.



Things get more interesting when you approach the problem as a database. 
 Save the content wherever and put the metadata into a table -- content 
hash (primary key), URL, download timestamp, author, subject, title, 
keywords, etc..  Create fully inverted indexes.  Create a search engine. 
 Create a spider.  Implementation could range from a CSV/TSV flat-file 
and shell/P* scripts, to a desktop database/UI, to a LAMP stack, and 
beyond (NoSQL, N-tier).  There are distributed file sharing systems 
based on such ideas.



David



Re: sha256sum --text generating blank spaces and hyphens?

2023-04-26 Thread Albretch Mueller
On 4/26/23, David Christensen  wrote:
> I suggest hashing the document content rather than the URL.  This would
> work nicely for static documents.

 What do you mean by "hashing the document content"?

 How would that help when what you are trying to do is cleanse and
canonize texts as best as you could to find relationships among their
text segments?

 lbrtchx



Re: sha256sum --text generating blank spaces and hyphens?

2023-04-26 Thread Nicolas George
David Christensen (12023-04-26):
> I suggest hashing the document content rather than the URL.  This would work
> nicely for static documents.

That will be very convenient to retrieve the document content from the
URL.

-- 
  Nicolas George



Re: sha256sum --text generating blank spaces and hyphens?

2023-04-26 Thread David Christensen

On 4/26/23 00:41, Albretch Mueller wrote:

  This is not a debian question per se (more like a Linux bash one),
but I wasn't able to find an answer on the Internet.

  Here is first the problem I am having before you start reading a
conspiracy theory into it ;-)

  I need to somehow map URL on the web to a local file, but you can't
do that for two main reasons:

  1) URLs are free text
  2) which people take to their heart's content.

  Take for example:

  
https://dokumen.pub/qdownload/nietzsche-und-der-deutsche-geist-band-4-ausbreitung-und-wirkung-des-nietzscheschen-werkes-im-deutschen-sprachraum-bis-zum-ende-des-zweiten-weltkrieges-ein-schrifttumsverzeichnis-der-jahre-1867-1945-ergnzungen-berichtigungen-und-gesamtverzeichnisse-zu-den-bnden-i-iii-9783110202861-9783110189865-3110189860.html

  that file and the pdf you would download I need to map to a local
directory looking like: ... /pub/dokumen/qdownload/ ...

  but the file name (excluding the extension) is 306 characters long,
which Windows NTFS would not swallow. There may be also funky rules
regarding character sets and where in a string certain chars may be
used; so, as a way to work around those kinds of problems I:

  a) encode the string name as base64
  b) calculate the sha256sum of §a
  c) use §b as file name (of course, leaving the original extension as it is)
  d) include a "§b_file_name.txt" plain text file decriptor which only
content is the actual prehash name of that file.


  
https://dokumen.pub/qdownload/nietzsche-und-der-deutsche-geist-band-4-ausbreitung-und-wirkung-des-nietzscheschen-werkes-im-deutschen-sprachraum-bis-zum-ende-des-zweiten-weltkrieges-ein-schrifttumsverzeichnis-der-jahre-1867-1945-ergnzungen-berichtigungen-und-gesamtverzeichnisse-zu-den-bnden-i-iii-9783110202861-9783110189865-3110189860.html
  
_TXT="nietzsche-und-der-deutsche-geist-band-4-ausbreitung-und-wirkung-des-nietzscheschen-werkes-im-deutschen-sprachraum-bis-zum-ende-des-zweiten-weltkrieges-ein-schrifttumsverzeichnis-der-jahre-1867-1945-ergnzungen-berichtigungen-und-gesamtverzeichnisse-zu-den-bnden-i-iii-9783110202861-9783110189865-3110189860"
  _B64TXTENC=$(printf '%s' "${_TXT}" | base64 )
  echo "// __ \$_B64TXTENC: |${_B64TXTENC}|"
  _B64TXTDEC=$(printf '%s' "${_B64TXTENC}" | base64 --decode)
  echo "// __ \$_B64TXTDEC: |${_B64TXTDEC}|"
  if [[ "${_TXT}" == "${_B64TXTDEC}" ]]; then
   echo "// __ [[ \${_TXT} == \${_B64TXTDEC} ]]: |${_TXT}|"
   _SHA256=$(printf '%s' "${_TXT}" | sha256sum --text )
   echo "// __ \$_SHA256: |${_SHA256}|"
  fi

// __ $_SHA256:
|7d5895cb24ab49692a8ad495e036074fec8e61b22040544f02a9b69c926dbdeb  -|

  I am trying to avoid funky characters and sha256sum --text still
generates them!?!

  I work like this because I need replicate the original URL as a local
path in a way that would be compatible any file system.

  Do you know of a better way to deal with such issues?

  lbrtchx



I will assume you have solved the sha256sum output issue.  (I would use 
Perl and Digest::SHA.)



I suggest hashing the document content rather than the URL.  This would 
work nicely for static documents.



David



Re: sha256sum --text generating blank spaces and hyphens?

2023-04-26 Thread Andy Smith
Hello,

On Wed, Apr 26, 2023 at 08:30:01PM +, Albretch Mueller wrote:
> On 4/26/23, Dan Ritter  wrote:
> > The only characters used in the sha256 hash itself are [a-f] and
> > [0-9]
> 
>  Yes, I knew that; that is why I could not understand why sha256sum
> was being "courteous" to me.

The man page is very clear on what the output will be, and just
running it on a few files should also make it obvious to you, also,
all the other sha*sum and md5sum utilities work pretty much the same
way. So I don't know why this comes as a surprise, but OK.

>  OK, now I see why cutting off the string on the first space that
> appears is safe. I never saw such cases because I always used sha*sums
> on files.

But it does the same thing with files…

Thanks,
Andy

-- 
https://bitfolk.com/ -- No-nonsense VPS hosting



Re: sha256sum --text generating blank spaces and hyphens?

2023-04-26 Thread Albretch Mueller
On 4/26/23, David Wright  wrote:
> I guess you need the expense of sha256 rather than md5 as you're
> downloading the entire web?

 I am not downloading the entire web. I have no way of knowing how
they entertained those ideations but I think we could use their
estimate when google said that approximately 1 million and a half
books have been ever published. Think of it! It is not that much data.
It would all fit nicely in one hard drive; include some searching
capability and "bye bye google" will be the name of your movie. At
times you need to gain a sense of things before going into exposed
mode to search for something (which these days means making sure you
are not being baited into something else)

On 4/26/23, Dan Ritter  wrote:
> The only characters used in the sha256 hash itself are [a-f] and
> [0-9]

 Yes, I knew that; that is why I could not understand why sha256sum
was being "courteous" to me.

On 4/26/23, Nicolas George  wrote:
> shaXsum always writes X/4 hexadecimal nibbles then two spaces then the
> file name. If the input is from stdin, then the convention is the file
> name is ‘-’.
>
> (Well, not always always: if the file name contains very special
> characters, it will use an escaped output format. And there is the -z
> option.)

On 4/26/23, Thomas Schmitt  wrote:
> "FILE" is the minus-sign for standard input. The second blank is there
> to indicate the text mode of sha256sum.
> Only the first blank is somewhat puzzling. But it's always there.
>
>
> https://www.gnu.org/software/coreutils/manual/html_node/sha2-utilities#sha2-utilities
> points to
>
> https://www.gnu.org/software/coreutils/manual/html_node/md5sum-invocation.html
> which says
>   For each file, ‘md5sum’ outputs by default, the MD5 checksum, a space,
>   a flag indicating binary or text input mode, and the file name. Binary
>   mode is indicated with ‘*’, text mode with ‘ ’ (space). Binary mode is
>   the default on systems where it’s significant, otherwise text mode is
>   the default. The cksum command always uses binary mode and a ‘ ’
>   (space) flag.
>
> So the first blank can be relied on and thus the proposal by Andy Smith
> to use "awk '{print $1}'" is valid.

 OK, now I see why cutting off the string on the first space that
appears is safe. I never saw such cases because I always used sha*sums
on files. I would expect if a user enters a string via printf that was
all there was to it. Of course, sha*sums can tell apart a file from a
text string.

On 4/26/23, Jeffrey Walton  wrote:
> There's no guarantee a URL will map onto a filesystem.

> I seem to
> recall Stunnel tried to do that in a caching mode, but it had weird
> corner cases. (In addition to problems with filesystems that had
> character set and path limitations).

 Well, no; and I am fine with:
 a) trying to best match both; the URL path as best as possible
 b) the extra malabarism base64-ing and hashsing the name of the file ...

 Something I have learned as a corpora research kind of guy is not to
ever try to "educate" people. I would just take their sh!t as they
dump it and cleanse, deal with it!

 You would not hear the end of it if I start telling stories of the
kind of cr@p you find out there when you look at the web from that
point of view.

> I think your best bet is to digest the URL into a representation. I
> suggest using SipHash+Base64 or Base64URL. SipHash provides collision
> resistance, a uniform distribution, and its fast. SipHash has a very
> good pedigree since it was designed by Jean-Philippe Aumasson and
> Daniel J. Bernstein. The final Base64 or Base64URL encoding ensures
> you stay within printable character range without reserved file system
> characters.

 Thank you I will look into what they did when I get a chance,

 lbrtchx


On 4/26/23, Albretch Mueller  wrote:
> On 4/26/23, David Wright  wrote:
>> I guess you need the expense of sha256 rather than md5 as you're
>> downloading the entire web?
>
>  I am not downloading the entire web. I have no way of knowing how
> they entertained those ideations but I think we could use their
> estimate when they said that approximately 1 million and a half books
> have been ever published. Think of it! It is not that much data. It
> would all fit nicely in one hard drive include some searching
> capability and "bye bye google" will be the name of your movie.
>
> On 4/26/23, Dan Ritter  wrote:
>> The only characters used in the sha256 hash itself are [a-f] and
>> [0-9]
>
>  Yes, I knew that; that is why I could not understand why sha256sum
> was being "courteous" to me.
>
> On 4/26/23, Nicolas George  wrote:
>> shaXsum always writes X/4 hexadecimal nibbles then two spaces then the
>> file name. If the input is from stdin, then the convention is the file
>> name is ‘-’.
>>
>> (Well, not always always: if the file name contains very special
>> characters, it will use an escaped output format. And there is the -z
>> option.)
>
> On 4/26/23, Thomas Schmitt  wrote:
>> "FILE" is the minus-sign for standard

Re: sha256sum --text generating blank spaces and hyphens?

2023-04-26 Thread Albretch Mueller
On 4/26/23, David Wright  wrote:
> I guess you need the expense of sha256 rather than md5 as you're
> downloading the entire web?

 I am not downloading the entire web. I have no way of knowing how
they entertained those ideations but I think we could use their
estimate when they said that approximately 1 million and a half books
have been ever published. Think of it! It is not that much data. It
would all fit nicely in one hard drive include some searching
capability and "bye bye google" will be the name of your movie.

On 4/26/23, Dan Ritter  wrote:
> The only characters used in the sha256 hash itself are [a-f] and
> [0-9]

 Yes, I knew that; that is why I could not understand why sha256sum
was being "courteous" to me.

On 4/26/23, Nicolas George  wrote:
> shaXsum always writes X/4 hexadecimal nibbles then two spaces then the
> file name. If the input is from stdin, then the convention is the file
> name is ‘-’.
>
> (Well, not always always: if the file name contains very special
> characters, it will use an escaped output format. And there is the -z
> option.)

On 4/26/23, Thomas Schmitt  wrote:
> "FILE" is the minus-sign for standard input. The second blank is there
> to indicate the text mode of sha256sum.
> Only the first blank is somewhat puzzling. But it's always there.
>
>
> https://www.gnu.org/software/coreutils/manual/html_node/sha2-utilities#sha2-utilities
> points to
>
> https://www.gnu.org/software/coreutils/manual/html_node/md5sum-invocation.html
> which says
>   For each file, ‘md5sum’ outputs by default, the MD5 checksum, a space,
>   a flag indicating binary or text input mode, and the file name. Binary
>   mode is indicated with ‘*’, text mode with ‘ ’ (space). Binary mode is
>   the default on systems where it’s significant, otherwise text mode is
>   the default. The cksum command always uses binary mode and a ‘ ’
>   (space) flag.
>
> So the first blank can be relied on and thus the proposal by Andy Smith
> to use "awk '{print $1}'" is valid.

 OK, now I see why cutting off the string on the first space that
appears is safe. I never saw such cases because I always used sha*sums
on files. I would expect if a user enters a string via printf that was
all there was to it. Of course, sha*sums can tell apart a file from a
string a plain text.

On 4/26/23, Jeffrey Walton  wrote:
> There's no guarantee a URL will map onto a filesystem.

> I seem to
> recall Stunnel tried to do that in a caching mode, but it had weird
> corner cases. (In addition to problems with filesystems that had
> character set and path limitations).

 Well, no; and I am fine with:
 a) trying to best match both; the URL path as best as possible
 b) the extra malabarism base64-ing and hashsing the name of the file ...

 Something I have learned as a corpora research kind of guy is not to
ever try to "educate" people. I would just take their sh!t as they
dump it and cleanse, deal with it!

 You would not hear the end of it if I start telling stories of the
kind of cr@p you find out there when you look at the web from that
point of view: from folks at archive.org who would list: "Henry
Valentine Miller", "Henry V. Miller", "Henry Miller", "henry miller",
"Miller, Henry", "Miller, Henry 12-1891 06-1980" apparently as
different authors/"creators", to the gutenberb.org large text bank
including some protagonistic bs in the actual texts, to developers of
libreoffice watermarking text with some cr@p which of course is being
used for "monitoring" purposes by the kinds of folks who put
"intelligence" in the names of the organizations they work for and to
make sure they are making sense they put flags around them when they
fart through their mouths whatever nonsense they think of.

 I had had rehearsing day dreams about becoming a dictator of the
world ;-) and making people do "the right thing" (tm) ... until I had
once an epiphany while watching Trump talk to a media prestitude who
caracteristically wasn't making much sense. After asking a few
questions trying to make sense of what she was saying, prestitude said
"let me formulate it better". Trump quietly sat back saying: "OK, take
your time"!!!

 I was amazed! There you have someone the U.S. media, who as a mouth
piece of the status quo, were being viscerally offensive towards
anything relating to him, including posting on the front page of
mainstream US news papers naked pictures of his wife and mother of his
child one month before she became "the first lady" and he took it
easy, respectfully on her! That was the best case I have noticed so
far of "separating the message from the messenger". I mean people who
erect all those pay walls and somehow see themselves as authoring,
guarding content are not even the messengers and we all have to put up
with their bs.

> I think your best bet is to digest the URL into a representation. I
> suggest using SipHash+Base64 or Base64URL. SipHash provides collision
> resistance, a uniform distribution, and its fast. SipHash has a very
> good pe

Re: sha256sum --text generating blank spaces and hyphens?

2023-04-26 Thread tomas
On Wed, Apr 26, 2023 at 02:33:03PM +, Albretch Mueller wrote:

[...]

>  because I would like to include the three strings in the file descriptor:
>  a) the crazy long name
>  b) its base64 representation
>  c) §b's sha256sum representation which is the one used for the file
> name and the log of the download.

[...]

It's your work, of course.

> >> // __ $_SHA256:
> >> |7d5895cb24ab49692a8ad495e036074fec8e61b22040544f02a9b69c926dbdeb  -|
> >
> >
> > I only see harmless hexadecimal chars there.
> >
> >>  I am trying to avoid funky characters and sha256sum --text still
> >> generates them!?!
> >
> > Where are there "funky chars"?
> 
>  This is the first time I have seen blank spaces and hyphens in a text
> segment's sum. Those characters might be confusing.

Ah -- I thing someone else (I think it was Dan, sorry if my memory
fails me) pointed that out already. The dash is the "file name"
(which in this case was stdin, this follows a widespread convention).

All those sums output the sum (never ever spaces in there), a
whitespace, then the file name. Background: you can give them
multiple args, then they generate a list of sums and names, which
you then can conveniently use with the -c option to see whether
any of the files has changed.

> > Besides, I don't think --text does what you think it does. Quoting
> > the manpage:
> >
> >   "Note: There is no difference between binary mode and text
> >mode on GNU systems."
> 
>  Thank you. I was playing with different options to see if that was
> the reason I was getting those white spaces and hyphens at the end.
> 
>  Why is that happening? How could it be avoided? COuld you set the
> characters used in the representation of a sum?

You just cut it out with, e.g. 'cut' like so:

  sha256sum | cut -d' ' -f1

Cheers
-- 
t


signature.asc
Description: PGP signature


Re: sha256sum --text generating blank spaces and hyphens?

2023-04-26 Thread Jeffrey Walton
On Wed, Apr 26, 2023 at 3:42 AM Albretch Mueller  wrote:
>
>  This is not a debian question per se (more like a Linux bash one),
> but I wasn't able to find an answer on the Internet.
>
>  Here is first the problem I am having before you start reading a
> conspiracy theory into it ;-)
>
>  I need to somehow map URL on the web to a local file, but you can't
> do that for two main reasons:
>
>  1) URLs are free text
>  2) which people take to their heart's content.
>
>  Take for example:
>
>  
> https://dokumen.pub/qdownload/nietzsche-und-der-deutsche-geist-band-4-ausbreitung-und-wirkung-des-nietzscheschen-werkes-im-deutschen-sprachraum-bis-zum-ende-des-zweiten-weltkrieges-ein-schrifttumsverzeichnis-der-jahre-1867-1945-ergnzungen-berichtigungen-und-gesamtverzeichnisse-zu-den-bnden-i-iii-9783110202861-9783110189865-3110189860.html
>
>  that file and the pdf you would download I need to map to a local
> directory looking like: ... /pub/dokumen/qdownload/ ...
>
>  but the file name (excluding the extension) is 306 characters long,
> which Windows NTFS would not swallow. There may be also funky rules
> regarding character sets and where in a string certain chars may be
> used; so, as a way to work around those kinds of problems I:
>
>  a) encode the string name as base64
>  b) calculate the sha256sum of §a
>  c) use §b as file name (of course, leaving the original extension as it is)
>  d) include a "§b_file_name.txt" plain text file decriptor which only
> content is the actual prehash name of that file.
>
>
>  
> https://dokumen.pub/qdownload/nietzsche-und-der-deutsche-geist-band-4-ausbreitung-und-wirkung-des-nietzscheschen-werkes-im-deutschen-sprachraum-bis-zum-ende-des-zweiten-weltkrieges-ein-schrifttumsverzeichnis-der-jahre-1867-1945-ergnzungen-berichtigungen-und-gesamtverzeichnisse-zu-den-bnden-i-iii-9783110202861-9783110189865-3110189860.html
>  
> _TXT="nietzsche-und-der-deutsche-geist-band-4-ausbreitung-und-wirkung-des-nietzscheschen-werkes-im-deutschen-sprachraum-bis-zum-ende-des-zweiten-weltkrieges-ein-schrifttumsverzeichnis-der-jahre-1867-1945-ergnzungen-berichtigungen-und-gesamtverzeichnisse-zu-den-bnden-i-iii-9783110202861-9783110189865-3110189860"
>  _B64TXTENC=$(printf '%s' "${_TXT}" | base64 )
>  echo "// __ \$_B64TXTENC: |${_B64TXTENC}|"
>  _B64TXTDEC=$(printf '%s' "${_B64TXTENC}" | base64 --decode)
>  echo "// __ \$_B64TXTDEC: |${_B64TXTDEC}|"
>  if [[ "${_TXT}" == "${_B64TXTDEC}" ]]; then
>   echo "// __ [[ \${_TXT} == \${_B64TXTDEC} ]]: |${_TXT}|"
>   _SHA256=$(printf '%s' "${_TXT}" | sha256sum --text )
>   echo "// __ \$_SHA256: |${_SHA256}|"
>  fi
>
> // __ $_SHA256:
> |7d5895cb24ab49692a8ad495e036074fec8e61b22040544f02a9b69c926dbdeb  -|
>
>  I am trying to avoid funky characters and sha256sum --text still
> generates them!?!
>
>  I work like this because I need replicate the original URL as a local
> path in a way that would be compatible any file system.
>
>  Do you know of a better way to deal with such issues?

There's no guarantee a URL will map onto a filesystem. I seem to
recall Stunnel tried to do that in a caching mode, but it had weird
corner cases. (In addition to problems with filesystems that had
character set and path limitations).

I think your best bet is to digest the URL into a representation. I
suggest using SipHash+Base64 or Base64URL. SipHash provides collision
resistance, a uniform distribution, and its fast. SipHash has a very
good pedigree since it was designed by Jean-Philippe Aumasson and
Daniel J. Bernstein. The final Base64 or Base64URL encoding ensures
you stay within printable character range without reserved file system
characters.

Jeff



Re: sha256sum --text generating blank spaces and hyphens?

2023-04-26 Thread Dan Ritter
Albretch Mueller wrote: 
> On 4/26/23, Andy Smith  wrote:
> > If you're referring to the space and then the file name ("-" in case
> > of stdin) on the end, you can just select only the first output up
> > to whitespace with e.g. awk:
> >
> > _SHA256=$(printf '%s' "${_TXT}" | sha256sum | awk '{print $1}')
> 
>  Yes, you could but I am trying to find out why this is happening
> instead of truncating the string when a space appears because I don't
> think what would be safe.

The white space and the - are not part of the sha256, they are
emitted by sha256sum as a courtesy. You can safely remove
everything starting with the first whitespace.

> >> // __ $_SHA256:
> >> |7d5895cb24ab49692a8ad495e036074fec8e61b22040544f02a9b69c926dbdeb  -|
> >
> > I only see harmless hexadecimal chars there.
> >
> >>  I am trying to avoid funky characters and sha256sum --text still
> >> generates them!?!
> >
> > Where are there "funky chars"?
> 
>  This is the first time I have seen blank spaces and hyphens in a text
> segment's sum. Those characters might be confusing.

The white space and the - are not part of the sha256, they are
emitted by sha256sum as a courtesy. You can safely remove
everything starting with the first whitespace.

>  Why is that happening? How could it be avoided? COuld you set the
> characters used in the representation of a sum?

The white space and the - are not part of the sha256, they are
emitted by sha256sum as a courtesy. You can safely remove
everything starting with the first whitespace.

The only characters used in the sha256 hash itself are [a-f] and
[0-9]

-dsr-



Re: sha256sum --text generating blank spaces and hyphens?

2023-04-26 Thread Thomas Schmitt
Hi,

Albretch Mueller wrote:
> > >  I am trying to avoid funky characters and sha256sum --text still
> > > generates them!?!

Andy Smith wrote:
> > If you're referring to the space and then the file name ("-" in case
> > of stdin) on the end, you can just select only the first output up
> > to whitespace with e.g. awk:

> Yes, you could but I am trying to find out why this is happening
> instead of truncating the string when a space appears because I don't
> think what would be safe.

One of the blanks and the hyphen-or-minus are announced by the man page:

  The default mode is
  to print a line with checksum, a character indicating input  mode  ('*'
  for binary, space for text), and name for each FILE.

"FILE" is the minus-sign for standard input. The second blank is there
to indicate the text mode of sha256sum.
Only the first blank is somewhat puzzling. But it's always there.

  
https://www.gnu.org/software/coreutils/manual/html_node/sha2-utilities#sha2-utilities
points to
  https://www.gnu.org/software/coreutils/manual/html_node/md5sum-invocation.html
which says
  For each file, ‘md5sum’ outputs by default, the MD5 checksum, a space,
  a flag indicating binary or text input mode, and the file name. Binary
  mode is indicated with ‘*’, text mode with ‘ ’ (space). Binary mode is
  the default on systems where it’s significant, otherwise text mode is
  the default. The cksum command always uses binary mode and a ‘ ’
  (space) flag.

So the first blank can be relied on and thus the proposal by Andy Smith
to use "awk '{print $1}'" is valid.


Have a nice day :)

Thomas



Re: sha256sum --text generating blank spaces and hyphens?

2023-04-26 Thread David Wright
On Wed 26 Apr 2023 at 14:33:03 (+), Albretch Mueller wrote:
> On 4/26/23, to...@tuxteam.de  wrote:
> >>  a) encode the string name as base64
> >>  b) calculate the sha256sum of §a
> >
> > Why the detour over base64?
> 
>  because I would like to include the three strings in the file descriptor:
>  a) the crazy long name
>  b) its base64 representation

The base64 command wraps the output, in case you didn't notice.

>  c) §b's sha256sum representation which is the one used for the file
> name and the log of the download.

I guess you need the expense of sha256 rather than md5 as you're
downloading the entire web?

>  I would like to make this scheme "fool (and fail) proof" as they say.
> There is no way in earth that a file system messes with all three
> aspects of it.
> 
> >>  c) use §b as file name (of course, leaving the original extension as it
> >> is)
> >
> > Why the extension? DOS nostalgia?
> 
>  The local copies should represent the web URLs as close as possible
> in order to minimize "what came from where" kinds of confusions. Also
> from the same URL you would then download the corresponding pdf file
> with exactly the same name, the only difference being the extension.

The extension is part of the name. If you preserve it as is, what
happens when it contains a "funky" character.

> >> // __ $_SHA256:
> >> |7d5895cb24ab49692a8ad495e036074fec8e61b22040544f02a9b69c926dbdeb  -|
> >
> >
> > I only see harmless hexadecimal chars there.
> >
> >>  I am trying to avoid funky characters and sha256sum --text still
> >> generates them!?!
> >
> > Where are there "funky chars"?
> 
>  This is the first time I have seen blank spaces and hyphens in a text
> segment's sum. Those characters might be confusing.

You calculated the sha256sum of stdin. - is the name of the input file
you encoded. Duh.

Cheers,
David.



Re: sha256sum --text generating blank spaces and hyphens?

2023-04-26 Thread Nicolas George
Albretch Mueller (12023-04-26):
>  Yes, you could but I am trying to find out why this is happening
> instead of truncating the string when a space appears because I don't
> think what would be safe.

shaXsum always writes X/4 hexadecimal nibbles then two spaces then the
file name. If the input is from stdin, then the convention is the file
name is ‘-’.

(Well, not always always: if the file name contains very special
characters, it will use an escaped output format. And there is the -z
option.)

For your case, just use “cut -c 1-64”.

> > Why the detour over base64?
>  because I would like to include the three strings in the file descriptor:
>  a) the crazy long name
>  b) its base64 representation
>  c) §b's sha256sum representation which is the one used for the file
> name and the log of the download.

Then do so, but in c, store the SHA-256 of the URL, not the SHA-256 of
the base64 encoding of the URL.

>  The local copies should represent the web URLs as close as possible
> in order to minimize "what came from where" kinds of confusions.

You are right to do so. Many utilities rely on the extension to decide
what to do with a file. Lacking a standardized place to store the file
type, it is the most robust options. Applications that rely on probing
and heuristics, like libfile and co., are in fact much less reliable and
a lot more annoying.

(Also, if we were to want a standardized place to store the file type, a
lot of user interface would have to be revamped.)

OTOH, HTTP does have a place to state the type of the file, and the
extension in URLs is not reliable: if you want to do it properly, you
must set your local file extension based on the Content-Type response
header.

>  Also
> from the same URL you would then download the corresponding pdf file
> with exactly the same name, the only difference being the extension.

Then you need to exclude the extension from the URL, but a lot of URLs
do not have extensions and you should be using the Content-Type instead.
This feature is a pipe dream.

>  This is the first time I have seen blank spaces and hyphens in a text
> segment's sum. Those characters might be confusing.

See above.

Regards,

-- 
  Nicolas George


signature.asc
Description: PGP signature


Re: sha256sum --text generating blank spaces and hyphens?

2023-04-26 Thread Albretch Mueller
On 4/26/23, Andy Smith  wrote:
> If you're referring to the space and then the file name ("-" in case
> of stdin) on the end, you can just select only the first output up
> to whitespace with e.g. awk:
>
> _SHA256=$(printf '%s' "${_TXT}" | sha256sum | awk '{print $1}')

 Yes, you could but I am trying to find out why this is happening
instead of truncating the string when a space appears because I don't
think what would be safe.

> These web sites can change their URLs at any time you know, so it
> may not be worth trying to replicate their structure locally.

 yes, I know and my way to deal with such issues is:

 a) by including in the name of the web log of the download the date
and time ...
 b) once the data file is downloaded, say a pdf file of an old book or
some publication, all the metadata in the front and back pages of the
book are OCRed, the actual title, ISBN, publishing date ...

On 4/26/23, to...@tuxteam.de  wrote:
>>  a) encode the string name as base64
>>  b) calculate the sha256sum of §a
>
> Why the detour over base64?

 because I would like to include the three strings in the file descriptor:
 a) the crazy long name
 b) its base64 representation
 c) §b's sha256sum representation which is the one used for the file
name and the log of the download.

 I would like to make this scheme "fool (and fail) proof" as they say.
There is no way in earth that a file system messes with all three
aspects of it.

>>  c) use §b as file name (of course, leaving the original extension as it
>> is)
>
> Why the extension? DOS nostalgia?

 The local copies should represent the web URLs as close as possible
in order to minimize "what came from where" kinds of confusions. Also
from the same URL you would then download the corresponding pdf file
with exactly the same name, the only difference being the extension.

>> // __ $_SHA256:
>> |7d5895cb24ab49692a8ad495e036074fec8e61b22040544f02a9b69c926dbdeb  -|
>
>
> I only see harmless hexadecimal chars there.
>
>>  I am trying to avoid funky characters and sha256sum --text still
>> generates them!?!
>
> Where are there "funky chars"?

 This is the first time I have seen blank spaces and hyphens in a text
segment's sum. Those characters might be confusing.

> Besides, I don't think --text does what you think it does. Quoting
> the manpage:
>
>   "Note: There is no difference between binary mode and text
>mode on GNU systems."

 Thank you. I was playing with different options to see if that was
the reason I was getting those white spaces and hyphens at the end.

 Why is that happening? How could it be avoided? COuld you set the
characters used in the representation of a sum?

 lbrtchx



Re: sha256sum --text generating blank spaces and hyphens?

2023-04-26 Thread Andy Smith
Hello,

On Wed, Apr 26, 2023 at 07:41:56AM +, Albretch Mueller wrote:
>   _SHA256=$(printf '%s' "${_TXT}" | sha256sum --text )
>   echo "// __ \$_SHA256: |${_SHA256}|"

[…]

> // __ $_SHA256:
> |7d5895cb24ab49692a8ad495e036074fec8e61b22040544f02a9b69c926dbdeb  -|
> 
>  I am trying to avoid funky characters and sha256sum --text still
> generates them!?!

If you're referring to the space and then the file name ("-" in case
of stdin) on the end, you can just select only the first output up
to whitespace with e.g. awk:

_SHA256=$(printf '%s' "${_TXT}" | sha256sum | awk '{print $1}')

Your use of "--text" does nothing by the way.

>  I work like this because I need replicate the original URL as a local
> path in a way that would be compatible any file system.

These web sites can change their URLs at any time you know, so it
may not be worth trying to replicate their structure locally.

Also maybe you want some sort of web site mirroring solution.

Cheers,
Andy

-- 
https://bitfolk.com/ -- No-nonsense VPS hosting



Re: sha256sum --text generating blank spaces and hyphens?

2023-04-26 Thread tomas
On Wed, Apr 26, 2023 at 07:41:56AM +, Albretch Mueller wrote:
>  This is not a debian question per se (more like a Linux bash one),
> but I wasn't able to find an answer on the Internet.
> 
>  Here is first the problem I am having before you start reading a
> conspiracy theory into it ;-)
> 
>  I need to somehow map URL on the web to a local file, but you can't
> do that for two main reasons:

OK.

[...]

>  but the file name (excluding the extension) is 306 characters long,
> which Windows NTFS [...]

There's the first problem.

>  a) encode the string name as base64
>  b) calculate the sha256sum of §a

Why the detour over base64?

>  c) use §b as file name (of course, leaving the original extension as it is)

Why the extension? DOS nostalgia?

>  d) include a "§b_file_name.txt" plain text file decriptor which only
> content is the actual prehash name of that file.

OK.

[...]

> // __ $_SHA256:
> |7d5895cb24ab49692a8ad495e036074fec8e61b22040544f02a9b69c926dbdeb  -|


I only see harmless hexadecimal chars there.

>  I am trying to avoid funky characters and sha256sum --text still
> generates them!?!

Where are there "funky chars"?

>  I work like this because I need replicate the original URL as a local
> path in a way that would be compatible any file system.
> 
>  Do you know of a better way to deal with such issues?

Besides, I don't think --text does what you think it does. Quoting
the manpage:

  "Note: There is no difference between binary mode and text
   mode on GNU systems."

This is about *reading* the input in text or binary mode, which are
equivalent in most civilised operating systems.

Cheers
-- 
t


signature.asc
Description: PGP signature


sha256sum --text generating blank spaces and hyphens?

2023-04-26 Thread Albretch Mueller
 This is not a debian question per se (more like a Linux bash one),
but I wasn't able to find an answer on the Internet.

 Here is first the problem I am having before you start reading a
conspiracy theory into it ;-)

 I need to somehow map URL on the web to a local file, but you can't
do that for two main reasons:

 1) URLs are free text
 2) which people take to their heart's content.

 Take for example:

 
https://dokumen.pub/qdownload/nietzsche-und-der-deutsche-geist-band-4-ausbreitung-und-wirkung-des-nietzscheschen-werkes-im-deutschen-sprachraum-bis-zum-ende-des-zweiten-weltkrieges-ein-schrifttumsverzeichnis-der-jahre-1867-1945-ergnzungen-berichtigungen-und-gesamtverzeichnisse-zu-den-bnden-i-iii-9783110202861-9783110189865-3110189860.html

 that file and the pdf you would download I need to map to a local
directory looking like: ... /pub/dokumen/qdownload/ ...

 but the file name (excluding the extension) is 306 characters long,
which Windows NTFS would not swallow. There may be also funky rules
regarding character sets and where in a string certain chars may be
used; so, as a way to work around those kinds of problems I:

 a) encode the string name as base64
 b) calculate the sha256sum of §a
 c) use §b as file name (of course, leaving the original extension as it is)
 d) include a "§b_file_name.txt" plain text file decriptor which only
content is the actual prehash name of that file.


 
https://dokumen.pub/qdownload/nietzsche-und-der-deutsche-geist-band-4-ausbreitung-und-wirkung-des-nietzscheschen-werkes-im-deutschen-sprachraum-bis-zum-ende-des-zweiten-weltkrieges-ein-schrifttumsverzeichnis-der-jahre-1867-1945-ergnzungen-berichtigungen-und-gesamtverzeichnisse-zu-den-bnden-i-iii-9783110202861-9783110189865-3110189860.html
 
_TXT="nietzsche-und-der-deutsche-geist-band-4-ausbreitung-und-wirkung-des-nietzscheschen-werkes-im-deutschen-sprachraum-bis-zum-ende-des-zweiten-weltkrieges-ein-schrifttumsverzeichnis-der-jahre-1867-1945-ergnzungen-berichtigungen-und-gesamtverzeichnisse-zu-den-bnden-i-iii-9783110202861-9783110189865-3110189860"
 _B64TXTENC=$(printf '%s' "${_TXT}" | base64 )
 echo "// __ \$_B64TXTENC: |${_B64TXTENC}|"
 _B64TXTDEC=$(printf '%s' "${_B64TXTENC}" | base64 --decode)
 echo "// __ \$_B64TXTDEC: |${_B64TXTDEC}|"
 if [[ "${_TXT}" == "${_B64TXTDEC}" ]]; then
  echo "// __ [[ \${_TXT} == \${_B64TXTDEC} ]]: |${_TXT}|"
  _SHA256=$(printf '%s' "${_TXT}" | sha256sum --text )
  echo "// __ \$_SHA256: |${_SHA256}|"
 fi

// __ $_SHA256:
|7d5895cb24ab49692a8ad495e036074fec8e61b22040544f02a9b69c926dbdeb  -|

 I am trying to avoid funky characters and sha256sum --text still
generates them!?!

 I work like this because I need replicate the original URL as a local
path in a way that would be compatible any file system.

 Do you know of a better way to deal with such issues?

 lbrtchx