On 2012-11-07 at 13:08:33 +0100, Hans Hagen wrote:

 > Hi Reinhard,
 > 
 > At my end, this works best:
 > 
 > function io.readall(f)
 >      local size = f:seek("end")
 >      if size == 0 then
 >          return ""
 >      elseif size < 1024*1024 then
 >          f:seek("set",0)
 >          return f:read('*all')
 >      else
 >          local done = f:seek("set",0)
 >          if size < 1024*1024 then
 >              step = 1024 * 1024
 >          elseif size > 16*1024*1024 then
 >              step = 16*1024*1024
 >          else
 >              step = math.floor(size/(1024*1024)) * 1024 * 1024 / 8
 >          end
 >          local data = { }
 >          while true do
 >              local r = f:read(step)
 >              if not r then
 >                  return table.concat(data)
 >              else
 >                  data[#data+1] = r
 >              end
 >          end
 >      end
 > end
 > 
 > usage:
 > 
 > local f = io.open(name)
 > if f then
 >    data = io.readall(f)
 >    f:close()
 > end
 > 
 > upto 50% faster and often less mem usage

Thank you, Hans.  Here it's faster than reading the file at once but
still slower than reading 8k Blocks.  It also consumes as much memory
as reading the file at once (and memory consumption grows
exponentially), but I could reduce memory consumption significantly
replacing 

  return table.concat(data)

with

  return data

table.concat() keeps the file twice in memory, once as a table and
once as a string.

 > btw, speed is not so much an issue (because network speed, disk
 > speed, os caching plays a role too and often manipulating that
 > large amounts of data takes way more processing time) but the less
 > mem consumption side effect is nice

Yes, memory consumption is a problem on my machine at work.  I'm
running Linux in a virtual machine under 32-bit Windows.  Windows can
only use 3GB of memory and uses 800MB itself.  Though I can assign
more than 3GB to the VM, I suppose that I actually have less than
2.2GB and the rest is provided by a swap file.  Furthermore, multi
tasking/multi user systems can only work if no program assumes that
it's the only one which is running.

Speed is important in many cases.  And I think that if you're writing
a function you want to use in various scripts, it's worthwhile to
evaluate the parameters carefully.

The idea I had was to write a function which allows to read a text
file efficiently.  It should also be flexible and easy to use.

In Lua it's convenient to read a file either line-by-line or at once.
Both are not efficient.  The first is extremely slow when lines are
short and the latter consumes a lot of memory.  And in many cases you
don't even need the content of the whole file. 

What I have so far is a function which reads a block and [the rest of]
a line within an endless loop.  Each chunk is split into lines.  It
takes two arguments, the file name and a function.  For each chunk,
the function is run on each line.  Thus I'm able to filter the data
and not everything has to be stored in memory.

------------------------------------------------
#! /usr/bin/env texlua
--*- Lua -*-

function readfile (filename, fun)
  local lineno=1
  fh=assert(io.open(filename, 'r'))
  while true do
    local line, rest = fh:read(2^13, '*line')
    if not line then break end
    if rest then line = line..rest end
    local tab = line:explode('\n')
    for i, v in ipairs(tab) do
      fun(v, lineno)
      lineno=lineno+1
    end
  end
  fh:close()
end

function process_line (line, n)
  print(n, line)
end

readfile ('testfile', process_line)

------------------------------------------------

Memory consumption is either 8kB or the length of the longest line
unless you store lines in a string or table.  Almost no extra memory
is needed if you manipulate each line somehow and write the result to
another file.  The only files I encountered which are really large are
CSV-like files which contain rows and columns of numbers, but the
function process_line() allows me to select only the rows and columns
I want to pass to pgfplots, for example.

 > at my end 2^24 is the most efficient (in time) block size

I found out that 2^13 is most efficient.  But I suppose that the most
important thing is that it's an integer multiple of a filesystem data
block.  Since Taco provided os.type() and os.name(), it's possible to
to make the chunk size system dependent.  But I fear that the actual
hardware (SSD vs. magnetic disk) has a bigger impact than the OS.

Regards,
  Reinhard

-- 
----------------------------------------------------------------------------
Reinhard Kotucha                                      Phone: +49-511-3373112
Marschnerstr. 25
D-30167 Hannover                              mailto:reinhard.kotu...@web.de
----------------------------------------------------------------------------
Microsoft isn't the answer. Microsoft is the question, and the answer is NO.
----------------------------------------------------------------------------
_______________________________________________
dev-luatex mailing list
dev-luatex@ntg.nl
http://www.ntg.nl/mailman/listinfo/dev-luatex

Reply via email to