Re: [CentOS] Gulliver

2017-10-31 Thread Gordon Messmer

On 10/31/2017 12:06 PM, Warren Young wrote:

This problem is*solved*.



Well, yes.  But if endian data is the problem, then it's pretty clear 
that none are in use, and I'm suggesting the absolute minimum-effort 
solution to the problem.


___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Gulliver

2017-10-31 Thread Warren Young
On Oct 31, 2017, at 12:47 PM, Gordon Messmer  wrote:
> 
> If this is an application that you've developed in-house, you should be using 
> htonl() to convert your 32-bit values to network byte order

…or its superset, XDR  [1]
…or use a text format (XML, JSON, YAML, SQL, CSV…)
…or use a binary serialization of same (BSON, CBOR, Binary XML…)
…or use FlatBuffers  [2]
…or use ASN.1  [3]

or, or or.  This problem is *solved*.  The only difficult part is choosing 
which of the many available solutions to use.



[1]: https://en.wikipedia.org/wiki/External_Data_Representation
[2]: https://en.wikipedia.org/wiki/FlatBuffers
[3]: https://en.wikipedia.org/wiki/Abstract_Syntax_Notation_One

___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Gulliver

2017-10-31 Thread Gordon Messmer

On 10/30/2017 10:07 AM, Chris Olson wrote:

All files are
loaded or moved from one machine to another with sftp.

The intern noticed right a way that the documents will transfer
perfectly from our PPC and SPARC machines to our Intel/CentOS
platforms.  The raw data files, not so much.  There is always
an Endian (Thanks Gulliver) issue, which we assume is due to
the bytes of data being formatted into 32 bit words somewhere
in the Big Endian systems.



It's unlikely that copying the files is causing the problem you 
observe.  As Peter suggested, you can use "md5sum" on the source and 
destination hosts to demonstrate that the files are not being modified 
in transmission.


However, endianness can be a problem if the applications you use naively 
save data to a file in their native byte order, and also read in native 
byte order.  In situations like that, a big-endian system will save data 
that the same application will fail to read, when it is run on a 
little-endian system.


If this is an application that you've developed in-house, you should be 
using htonl() to convert your 32-bit values to network byte order and 
writing that value to the data file, and using ntohl() to convert 32-bit 
values that you read from data files to the native host byte order.


___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Gulliver

2017-10-30 Thread Stephen John Smoogen
On 30 October 2017 at 13:07, Chris Olson  wrote:
> We have been fortunate to hang onto one of our summer interns
> for part time work on weekends during the current school year.
> One of the intern's jobs is to load documents and data which
> are then processed.  The documents are .txt, .docx, and .pdf
> files. The data files are raw sensor outputs usually captured
> using ADCs mostly with eight bit precision.  All files are
> loaded or moved from one machine to another with sftp.
>
> The intern noticed right a way that the documents will transfer
> perfectly from our PPC and SPARC machines to our Intel/CentOS
> platforms.  The raw data files, not so much.  There is always
> an Endian (Thanks Gulliver) issue, which we assume is due to
> the bytes of data being formatted into 32 bit words somewhere
> in the Big Endian systems.  It is not totally clear why the
> document files do not have this issue.  If there is a known
> principle behind these observations, we would appreciate very
> much any information that can shared.
>
>

Text files which are ascii are generally 7->8 bit so don't tend to
have bit endian problems in 8+ bit architectures. [I expect a 4 bit
architecture would have problems].  Now 8+ bit UTF can have some
problems with endianess but it is usually not following some standard
and assuming that writing data works the same as it did with ascii
(mainly because few people dealt with 4 bit computers).

docx and pdf is written for a fixed endian format so even if
built/written on a big endian system the data itself is formatted to
be little endian. Raw data files are usually endian if they are 'raw'
memory dumps or similar. Some 'data' formats which are mostly raw are
actually written to a standard which will work because both the little
endian and big endian expects the data to be written in 'big' or
'little' endian and read in as such.



> ___
> CentOS mailing list
> CentOS@centos.org
> https://lists.centos.org/mailman/listinfo/centos



-- 
Stephen J Smoogen.
___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Gulliver

2017-10-30 Thread Peter Kjellström
On Mon, 30 Oct 2017 17:07:31 + (UTC)
Chris Olson  wrote:

> We have been fortunate to hang onto one of our summer interns
> for part time work on weekends during the current school year.
> One of the intern's jobs is to load documents and data which
> are then processed.  The documents are .txt, .docx, and .pdf
> files. The data files are raw sensor outputs usually captured
> using ADCs mostly with eight bit precision.  All files are
> loaded or moved from one machine to another with sftp.
> 
> The intern noticed right a way that the documents will transfer
> perfectly from our PPC and SPARC machines to our Intel/CentOS
> platforms.  The raw data files, not so much.  There is always
> an Endian (Thanks Gulliver) issue, which we assume is due to
> the bytes of data being formatted into 32 bit words somewhere
> in the Big Endian systems.  It is not totally clear why the
> document files do not have this issue.  If there is a known
> principle behind these observations, we would appreciate very
> much any information that can shared.

Transferring a file will not change anything. It will be bit-wise
identical.

However the data in the file may be in bit-wise little or big endian
order. A file format may or may not have metadata indicating this.
That is, some files will read differently on different arch'es and
some will be immune (due to more sophisticated abstractions).

So it's not surprising that your raw files will have problems.

If you want to prove this to yourself simply md5sum/sha1sum/etc the
files on both sides.

/Peter K
___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos