On 6/27/2020 2:56 AM, Reinhard Kotucha wrote:
On 2020-06-20 at 10:25:33 +0200, Hans Hagen wrote:
> On 6/19/2020 11:16 PM, Reinhard Kotucha wrote:
> > Hi,
> > it's nice that with the fio library LuaTeX can now process binary
> > files. What I'm missing is the ability to specify the byte order
> > (little vs. big endian).
> >
> > I didn't find any hint, neither in the manual nor in the sources.
> >
> > Without being able to specify the byte order usage of the library is
> > quite limited. The byte order of a particular file is not necessarily
> > the same as that of your system.
> >
> > Some file formats have a certain byte order (PNM, for instance) and
> > others precede binary data with a byte order mark (TIFF). In any case
> > it's necessary to specify the byte order before reading binary stuff.
> >
> > Is there a chance to provide a switch?
>
> When I have time I'll backport a couple of the additional
> [integer|cardinal]*_le ones that we have in luametatex (I though that
> i'd already done that).
Hi Hans,
I must admit that I don't know anything about luametatex. I just
looked into liolibext.c .
IMO there are a few things to consider.
The current code extracts single bytes from a file.
| static int readcardinal2(lua_State *L) {
| FILE *f = tofile(L);
| int a = getc(f);
| int b = getc(f);
|
This, and even the extraction of short strings, is extremely slow.
It's much more efficient to read data blocks of 8192 bytes, for
instance, into memory and to process these data blocks. I'm not
convinced that reading a complete file into memory is a good idea,
despite its simplicity.
that would add all kind of overhead (buffer underrun, adapting to seek
etc and therefore reload) and we can assume that the operating system
also buffers
Processing the content of a file with the fio library is then similar
to processing a string with the sio library, with the exception that
endianness has to be considered when files are involved.
it depends on what one does, sometimes a full load and using sio is
faster but that also has its overhead (pseudo seek)
as usual i did lots of (performance) tests and there is not that much to
gain on either end (several variants were played with)
The host byte order must always be determined automatically, either
with Luigi's approach or probably more easily with ntohs(3) if this
function is available on Windows too. The file byte order has to be
specified by the user because it depends on the file format.
the lib is meant for usage in known scenarios (known, documented file
formats), not arbitrary, depending on architecture or implementation
(btw, the format file used to normalize to hig endian but that was
dropped long ago already: formats are no longer portable, which in fact
was already dropped before that)
If a particular file format has a BOM in its header, the BOM can be
evaluated by the user, for instance with fio.readline(). This means
that a user should be able to specify the andianness at any time, not
necessarily in advance.
sure but a few extra readers would solve that
As far as I understand it's sufficient that the relevant functions
read{cardinal,integer}{2,4} obey a flag which tells them whether byte
re-ordering is necessary. The flag has to be set if host and file
byte orders are different. I don't know whether we have to consider
64 bit integers too.
that adds passing parameters and checking them for each call ... you can
then as well use lua's 'read' function and convert with string.byte/char
which is then about equally fast
If you intend to go this way the number of functions in liolibext.c
can be halved because there is no significant difference between a
buffer and a string. Only very few functions have to be aware of
endianness.
halved in calls to simple functions, enlarged by more checking .. .more
pain than gain
There is one difference though. A string is always complete while a
buffer contains only a part of a file. If a there are not enough
bytes at the end of a buffer in order to fulfill a request, the
missing bytes can be loaded from the file and appended to the buffer.
This has no significant impact on speed because it happens quite
rarely. It's similar to the example in PIL, chapter 'The complete I/O
Model', section 'A small performance trick'.
If the user doesn't specify a byte order we can assume host byte
order. I can't imagine any reasonable use case right now, except
if a temporary file is read by the same process that created it.
as we have lua 5.3 you can consider using the string.unpack function
Hans
-----------------------------------------------------------------
Hans Hagen | PRAGMA ADE
Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl
-----------------------------------------------------------------