Re: [Jprogramming] memory mapped tab delimited file

Ric Sherlock Thu, 10 Oct 2013 15:52:54 -0700

I've been working with big (multiple GB) text files and recently put
together the following that seems to work Ok.
freadLFblk is a slightly modified version freadblock (it enables the block
size to be specified). freadLFblks then essentially
loops through the blocks. There are probably a few wrinkles to be ironed
out yet so if you find them please share!


Note 'Example use'
NB. the following reads the input file a chunk at a time, reformating and
writing each chunk as it goes

'myoutputfile.txt' (fappends~ makeMap@parseMap) freadLFblks
'myinputfile.txt';3e6

NB. where parseMap & makeMap parse and reformat a literal chunk of the
input file into the desired format

)

NB.*freadLFblk a Read block of LF-terminated lines from file
NB. m is: blocksize (e.g. ~1e6)
NB. y is: filename;start position
NB. returns: block;new start position
NB. eg: 1e6 freadLFblk myfilename
freadLFblk=: 1 : 0
  'fn strt'=. y
  fn=. > fboxname fn
  sz=. 1!:4 <fn
  NB. if. sz = _1 do. return. end.
  if. (sz = 0) +. strt >: sz do. '';strt return. end.
  if. m < sz-strt do.
    dat=. 1!:11 fn;strt,m
    len=. 1 + dat i: LF
    if. len > #dat do.
      'file not in LF-delimited lines' 13!:8[3
    end.
    strt=. strt + len
    dat=. len {. dat
  else.
    dat=. 1!:11 fn;strt,sz-strt
    dat=. dat, LF -. {: dat
    strt=. sz
  end.
  dat;strt
)

NB.*freadLFblks a Reads LF-delimited file in chunks
NB. u is: dyadic verb that determines how to process the chunks
NB. y is: literal or 2-item boxed list
NB.   0{::y is filename to read
NB.   1{::y is optional integer specifying size of chunks in bytes
NB. x is: Optional parameters that target the chunk processing verb
NB. Verb (u) that processes chunks takes chunk as right argument
NB. and any additonal args as the left argument. If none are
NB. required then the verb should accept but ignore left arg
freadLFblks=: 1 : 0
  '' u freadLFblks y
:
  'fn blksz'=. 2 {. (boxopen y),< 6e6
  sz=. fsize fn
  strt=. 0
  assert. sz > strt
  'chunkdat strt'=. blksz freadLFblk fn;strt    NB. read first chunk file
  res=. x u chunkdat
  while. sz > strt do.
    'chunkdat strt'=. blksz freadLFblk fn;strt  NB. read remaining chunks
of file
    res=. res, x u chunkdat
  end.
  res
)


On Fri, Oct 11, 2013 at 8:23 AM, Raul Miller <[email protected]> wrote:

> When the size of your data exceeds some significant fraction of
> available memory, it's probably worth using a loop.
>
> In other words: first develop your code so it works on a smaller data
> set, then pick some suitably large block size (1GB?) and loop over
> however many blocks you need.
>
> Loops are more complicated and they do have some overhead, but in some
> situations those are trivial costs.
>
> Thanks,
>
> --
> Raul
>
>
> On Thu, Oct 10, 2013 at 2:27 PM, Pascal Jasmin <[email protected]>
> wrote:
> > +/ (3 :'(2{"1 (< ;._1 TAB -.~ y))=<''ABC''')"1  mf
> >
> >
> > match '-:' might be faster than =, but overall just a tacit version:
> >
> > +/ (<'ABC') = 2&{@:(<;._1 TAB -.~ ])"1 mf
> >
> > untested
> >
> >
> > ----- Original Message -----
> > From: Joe Bogner <[email protected]>
> > To: [email protected]
> > Cc:
> > Sent: Thursday, October 10, 2013 2:02:07 PM
> > Subject: [Jprogramming] memory mapped tab delimited file
> >
> > I have a 5 gig, 9 million row tab delimited file that I'm working with.
> >
> > I started with a subset of 300k records and used fapplylines. It took
> about
> > 5 seconds. I shaved 2 seconds off by modifying fapplylines to use memory
> > mapped files
> >
> > I then applied it to my larger file and found that it was taking about
> 220
> > seconds. Not bad, but I wanted to push for something faster.
> >
> > Using a memory mapped file was simple enough. I wrote a routine to add a
> > column and pad it to the longest column (600 characters).
> >
> > $ mf
> > 9667548 602
> >
> > I'd like to keep it in a tab delimited file if possible because I'm using
> > that file for other purposes.
> >
> > The file is so large that I don't think I'll be able to cut it up ahead
> of
> > time into an inverted table or otherwise (but maybe?), so I'm effectively
> > looping through
> >
> > I've played with different variants and came up with the following
> > statement to count the number of rows that have column 2 = ABC
> >
> > +/ (3 :'(2{"1 (< ;._1 TAB -.~ y))=<''ABC''')"1  mf
> >
> > This gives the correct result and takes about 102 seconds and only uses
> > about 2 gig of memory while running and settles back down to 500mb
> >
> > I picked off some of the syntax _1 TAB -.~ from other posts.
> >
> > Is there any ideas on how to make it go faster or am I up against
> hardware
> > limit? By the way, I'm impressed with this speed as is. It takes about
> 348
> > seconds to read into R using the heavily optimized data.table fread
> package
> > which also uses memory mapped files. The standard import is more than a
> few
> > hours. I can go from start to finish in J in under 102 seconds.
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> >
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] memory mapped tab delimited file

Reply via email to