Assumptions:
- no double tabs
- no blank lines
- no blank columns
- no line starts with a tab
- no column itself contains a tab
- no CR
- more than 2 columns
- file is whole (rank 1 and not split by LF)


st =. 1 0 , 0 6 ,: 0 0
st =. st ,: 0 6 , 0 6 ,: 2 1
st =. st , 3 3 , 0 6 ,: 2 0
st =. st , 3 0 , 0 0 ,: 3 0
tab =. 9 { a.
secondcolumn =. (0;st;(< tab;LF))&;:  NB. Boxed list of the second column

(<'ABC') +/@:= secondcolumn mf

> > ----- Original Message -----
> > > From: Joe Bogner <[email protected]>
> > > To: [email protected]
> > > Cc:
> > > Sent: Thursday, October 10, 2013 2:02:07 PM
> > > Subject: [Jprogramming] memory mapped tab delimited file
> > >
> > > I have a 5 gig, 9 million row tab delimited file that I'm working with.
> > >
> > > I started with a subset of 300k records and used fapplylines. It took
> > about
> > > 5 seconds. I shaved 2 seconds off by modifying fapplylines to use
> memory
> > > mapped files
> > >
> > > I then applied it to my larger file and found that it was taking about
> > 220
> > > seconds. Not bad, but I wanted to push for something faster.
> > >
> > > Using a memory mapped file was simple enough. I wrote a routine to add
> a
> > > column and pad it to the longest column (600 characters).
> > >
> > > $ mf
> > > 9667548 602
> > >
> > > I'd like to keep it in a tab delimited file if possible because I'm
> using
> > > that file for other purposes.
> > >
> > > The file is so large that I don't think I'll be able to cut it up ahead
> > of
> > > time into an inverted table or otherwise (but maybe?), so I'm
> effectively
> > > looping through
> > >
> > > I've played with different variants and came up with the following
> > > statement to count the number of rows that have column 2 = ABC
> > >
> > > +/ (3 :'(2{"1 (< ;._1 TAB -.~ y))=<''ABC''')"1  mf
> > >
> > > This gives the correct result and takes about 102 seconds and only uses
> > > about 2 gig of memory while running and settles back down to 500mb
> > >
> > > I picked off some of the syntax _1 TAB -.~ from other posts.
> > >
> > > Is there any ideas on how to make it go faster or am I up against
> > hardware
> > > limit? By the way, I'm impressed with this speed as is. It takes about
> > 348
> > > seconds to read into R using the heavily optimized data.table fread
> > package
> > > which also uses memory mapped files. The standard import is more than a
> > few
> > > hours. I can go from start to finish in J in under 102 seconds.
> > > ----------------------------------------------------------------------
> > > For information about J forums see http://www.jsoftware.com/forums.htm
> > >
> > > ----------------------------------------------------------------------
> > > For information about J forums see http://www.jsoftware.com/forums.htm
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> >
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to