Re: [Jprogramming] memory mapped tab delimited file

Devon McCormick Fri, 11 Oct 2013 10:17:44 -0700

I've used this code on large files - it's an adverb that applies an
arbitrary verb to sequential blocks in a file.  The example usage "CTR [
((10{a.)&(4 : 'CTR=: CTR + x +/ . = >0{y')) doSomething ^:_ ] 0;1e6;(fsize
'bigFile.txt');'bigFile.txt' [ CTR=: 0" accumulates the number of lines in
a file in the global "CTR".


For the files on which I'm working, I assume the first line is a header, so
I pull this off the first time through and pass it along to subsequent
invocations.

NB.* workOnLargeFile.ijs: apply arbitrary verb across large file in blocks.

NB.* doSomething: do something to a large file in sequential blocks.
doSomething=: 1 : 0
   'curptr chsz max flnm leftover hdr'=. 6{.y
   if. curptr>:max do. ch=. curptr;chsz;max;flnm
   else. if. 0=curptr do. ch=. readChunk curptr;chsz;max;flnm
           chunk=. leftover,CR-.~>_1{ch NB. Work up to last complete line.
           'chunk leftover'=. (>:chunk i: LF) split chunk
           'hdr body'=. (>:chunk i. LF) split chunk    NB. Assume 1st line
is header.
           hdr=. }:hdr                  NB. Retain trailing partial line as
"leftover".
       else. chunk=. leftover,CR-.~>_1{ch=. readChunk curptr;chsz;max;flnm
           'body leftover'=. (>:chunk i: LF) split chunk
       end.
       u body;<hdr
   end.
   (4{.ch),leftover;<hdr
NB.EG CTR [ ((10{a.)&(4 : 'CTR=: CTR + x +/ . = >0{y')) doSomething ^:_ ]
0;1e6;(fsize 'bigFile.txt');'bigFile.txt' [ CTR=: 0
)

readChunk=: 3 : 0
   'curptr chsz max flnm'=. 4{.y
   if. 0<chsz2=. chsz<.0>.max-curptr do. chunk=. fread flnm;curptr,chsz2
   else. chunk=. '' end.
   (curptr+chsz2);chsz2;max;flnm;chunk
NB.EG chunk=. >_1{ch0=. readChunk 0;1e6;(fsize 'bigFile.txt');'bigFile.txt'
)



On Fri, Oct 11, 2013 at 8:03 AM, Raul Miller <[email protected]> wrote:

> If you can make a memory mapped file approach work for you, I expect
> that it would be a lot faster than a buffered block approach, at least
> on current machines.
>
> Thanks,
>
> --
> Raul
>
> On Fri, Oct 11, 2013 at 7:46 AM, Joe Bogner <[email protected]> wrote:
> > Ganesh - Thank you. That is very fast. It runs on 4 seconds on my machine
> > with a 9 million row sample string.
> >
> > st =. 1 0 , 0 6 ,: 0 0
> >
> > st =. st ,: 0 6 , 0 6 ,: 2 1
> >
> > st =. st , 3 3 , 0 6 ,: 2 0
> >
> > st =. st , 3 0 , 0 0 ,: 3 0
> >
> > tab =. 9 { a.
> >
> > secondcolumn =. (0;st;(< tab;LF))&;: NB. Boxed list of the second column
> >
> >
> > lines=.('foo',TAB,'ABC',TAB,LF,'foo',TAB,'Q',TAB,LF)
> >
> > mf=. , L:0 (9e6, # lines) $ lines
> >
> > (6!:2) 'c=:(<''ABC'') +/@:= secondcolumn mf'
> >
> > c
> >
> >
> > On my real data it takes 49 seconds -- the string is considerably
> longer. I
> > estimate 3.6 billion bytes vs 144 million.  It doesn't produce the
> correct
> > result because it doesn't hold to the assumptions, however it's great to
> > see that what is possible with more work.
> >
> >
> >
> > Ric - I haven't tested what you provided yet. Conceptually, is there any
> > reason to believe that it'd be faster than working with fixed width
> memory
> > mapped file? The line terminations are pre-determined at that point, so
> > it's just a matter of looping over the indices and reformatting each
> line.
> >
> >
> > I also don't quite understand the difference between fapplylines and
> > freadblock. It sounds like the freadblock discards the partial lines and
> > returns the new index and fapplylines retains the partial.
> > http://www.jsoftware.com/jwiki/RicSherlock/FileProcessing, It sounds
> like
> > freadblock is more like what Raul is describing. I'm fuzzy because both
> > process the data in chunks. Thanks again.
> >
> >
> > On Thu, Oct 10, 2013 at 8:26 PM, Ganesh Rapolu <[email protected]>
> wrote:
> >
> >> Oops. The previous code gives the first column. This is the corrected
> code,
> >> with the same assumptions as before but with more
> >> than 3 columns.
> >>
> >> st =. ,: 1 0 , 0 6 ,: 0 0
> >> st =. st , 2 0 , 0 6 ,: 1 0
> >> st =. st , 0 6 , 0 6 ,: 3 1
> >> st =. st , 4 3 , 0 6 ,: 3 0
> >> st =. st , 4 0 , 0 0 ,: 4 0
> >> tab =. 9 { a. NB. safer for email
> >> secondcolumn =. (0;st;(< tab;LF))&;:  NB. Boxed list of the second
> column
> >>
> >> (<'ABC') +/@:= secondcolumn mf
> >>
> >>
> >>
> >> On Thu, Oct 10, 2013 at 2:14 PM, Ganesh Rapolu <[email protected]>
> >> wrote:
> >>
> >> > Assumptions:
> >> > - no double tabs
> >> > - no blank lines
> >> > - no blank columns
> >> > - no line starts with a tab
> >> > - no column itself contains a tab
> >> > - no CR
> >> > - more than 2 columns
> >> > - file is whole (rank 1 and not split by LF)
> >> >
> >> >
> >> > st =. 1 0 , 0 6 ,: 0 0
> >> > st =. st ,: 0 6 , 0 6 ,: 2 1
> >> > st =. st , 3 3 , 0 6 ,: 2 0
> >> > st =. st , 3 0 , 0 0 ,: 3 0
> >> > tab =. 9 { a.
> >> > secondcolumn =. (0;st;(< tab;LF))&;:  NB. Boxed list of the second
> column
> >> >
> >> > (<'ABC') +/@:= secondcolumn mf
> >> >
> >> >  > > ----- Original Message -----
> >> >> > > From: Joe Bogner <[email protected]>
> >> >> > > To: [email protected]
> >> >> > > Cc:
> >> >> > > Sent: Thursday, October 10, 2013 2:02:07 PM
> >> >> > > Subject: [Jprogramming] memory mapped tab delimited file
> >> >> > >
> >> >> > > I have a 5 gig, 9 million row tab delimited file that I'm working
> >> >> with.
> >> >> > >
> >> >> > > I started with a subset of 300k records and used fapplylines. It
> >> took
> >> >> > about
> >> >> > > 5 seconds. I shaved 2 seconds off by modifying fapplylines to use
> >> >> memory
> >> >> > > mapped files
> >> >> > >
> >> >> > > I then applied it to my larger file and found that it was taking
> >> about
> >> >> > 220
> >> >> > > seconds. Not bad, but I wanted to push for something faster.
> >> >> > >
> >> >> > > Using a memory mapped file was simple enough. I wrote a routine
> to
> >> >> add a
> >> >> > > column and pad it to the longest column (600 characters).
> >> >> > >
> >> >> > > $ mf
> >> >> > > 9667548 602
> >> >> > >
> >> >> > > I'd like to keep it in a tab delimited file if possible because
> I'm
> >> >> using
> >> >> > > that file for other purposes.
> >> >> > >
> >> >> > > The file is so large that I don't think I'll be able to cut it up
> >> >> ahead
> >> >> > of
> >> >> > > time into an inverted table or otherwise (but maybe?), so I'm
> >> >> effectively
> >> >> > > looping through
> >> >> > >
> >> >> > > I've played with different variants and came up with the
> following
> >> >> > > statement to count the number of rows that have column 2 = ABC
> >> >> > >
> >> >> > > +/ (3 :'(2{"1 (< ;._1 TAB -.~ y))=<''ABC''')"1  mf
> >> >> > >
> >> >> > > This gives the correct result and takes about 102 seconds and
> only
> >> >> uses
> >> >> > > about 2 gig of memory while running and settles back down to
> 500mb
> >> >> > >
> >> >> > > I picked off some of the syntax _1 TAB -.~ from other posts.
> >> >> > >
> >> >> > > Is there any ideas on how to make it go faster or am I up against
> >> >> > hardware
> >> >> > > limit? By the way, I'm impressed with this speed as is. It takes
> >> about
> >> >> > 348
> >> >> > > seconds to read into R using the heavily optimized data.table
> fread
> >> >> > package
> >> >> > > which also uses memory mapped files. The standard import is more
> >> than
> >> >> a
> >> >> > few
> >> >> > > hours. I can go from start to finish in J in under 102 seconds.
> >> >> > >
> >> ----------------------------------------------------------------------
> >> >> > > For information about J forums see
> >> >> http://www.jsoftware.com/forums.htm
> >> >> > >
> >> >> > >
> >> ----------------------------------------------------------------------
> >> >> > > For information about J forums see
> >> >> http://www.jsoftware.com/forums.htm
> >> >> >
> ----------------------------------------------------------------------
> >> >> > For information about J forums see
> >> http://www.jsoftware.com/forums.htm
> >> >> >
> >> >>
> ----------------------------------------------------------------------
> >> >> For information about J forums see
> http://www.jsoftware.com/forums.htm
> >> >>
> >> >
> >> >
> >> ----------------------------------------------------------------------
> >> For information about J forums see http://www.jsoftware.com/forums.htm
> >>
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>



-- 
Devon McCormick, CFA
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] memory mapped tab delimited file

Reply via email to