Ganesh - Thank you. That is very fast. It runs on 4 seconds on my machine
with a 9 million row sample string.

st =. 1 0 , 0 6 ,: 0 0

st =. st ,: 0 6 , 0 6 ,: 2 1

st =. st , 3 3 , 0 6 ,: 2 0

st =. st , 3 0 , 0 0 ,: 3 0

tab =. 9 { a.

secondcolumn =. (0;st;(< tab;LF))&;: NB. Boxed list of the second column


lines=.('foo',TAB,'ABC',TAB,LF,'foo',TAB,'Q',TAB,LF)

mf=. , L:0 (9e6, # lines) $ lines

(6!:2) 'c=:(<''ABC'') +/@:= secondcolumn mf'

c


On my real data it takes 49 seconds -- the string is considerably longer. I
estimate 3.6 billion bytes vs 144 million.  It doesn't produce the correct
result because it doesn't hold to the assumptions, however it's great to
see that what is possible with more work.



Ric - I haven't tested what you provided yet. Conceptually, is there any
reason to believe that it'd be faster than working with fixed width memory
mapped file? The line terminations are pre-determined at that point, so
it's just a matter of looping over the indices and reformatting each line.


I also don't quite understand the difference between fapplylines and
freadblock. It sounds like the freadblock discards the partial lines and
returns the new index and fapplylines retains the partial.
http://www.jsoftware.com/jwiki/RicSherlock/FileProcessing, It sounds like
freadblock is more like what Raul is describing. I'm fuzzy because both
process the data in chunks. Thanks again.


On Thu, Oct 10, 2013 at 8:26 PM, Ganesh Rapolu <[email protected]> wrote:

> Oops. The previous code gives the first column. This is the corrected code,
> with the same assumptions as before but with more
> than 3 columns.
>
> st =. ,: 1 0 , 0 6 ,: 0 0
> st =. st , 2 0 , 0 6 ,: 1 0
> st =. st , 0 6 , 0 6 ,: 3 1
> st =. st , 4 3 , 0 6 ,: 3 0
> st =. st , 4 0 , 0 0 ,: 4 0
> tab =. 9 { a. NB. safer for email
> secondcolumn =. (0;st;(< tab;LF))&;:  NB. Boxed list of the second column
>
> (<'ABC') +/@:= secondcolumn mf
>
>
>
> On Thu, Oct 10, 2013 at 2:14 PM, Ganesh Rapolu <[email protected]>
> wrote:
>
> > Assumptions:
> > - no double tabs
> > - no blank lines
> > - no blank columns
> > - no line starts with a tab
> > - no column itself contains a tab
> > - no CR
> > - more than 2 columns
> > - file is whole (rank 1 and not split by LF)
> >
> >
> > st =. 1 0 , 0 6 ,: 0 0
> > st =. st ,: 0 6 , 0 6 ,: 2 1
> > st =. st , 3 3 , 0 6 ,: 2 0
> > st =. st , 3 0 , 0 0 ,: 3 0
> > tab =. 9 { a.
> > secondcolumn =. (0;st;(< tab;LF))&;:  NB. Boxed list of the second column
> >
> > (<'ABC') +/@:= secondcolumn mf
> >
> >  > > ----- Original Message -----
> >> > > From: Joe Bogner <[email protected]>
> >> > > To: [email protected]
> >> > > Cc:
> >> > > Sent: Thursday, October 10, 2013 2:02:07 PM
> >> > > Subject: [Jprogramming] memory mapped tab delimited file
> >> > >
> >> > > I have a 5 gig, 9 million row tab delimited file that I'm working
> >> with.
> >> > >
> >> > > I started with a subset of 300k records and used fapplylines. It
> took
> >> > about
> >> > > 5 seconds. I shaved 2 seconds off by modifying fapplylines to use
> >> memory
> >> > > mapped files
> >> > >
> >> > > I then applied it to my larger file and found that it was taking
> about
> >> > 220
> >> > > seconds. Not bad, but I wanted to push for something faster.
> >> > >
> >> > > Using a memory mapped file was simple enough. I wrote a routine to
> >> add a
> >> > > column and pad it to the longest column (600 characters).
> >> > >
> >> > > $ mf
> >> > > 9667548 602
> >> > >
> >> > > I'd like to keep it in a tab delimited file if possible because I'm
> >> using
> >> > > that file for other purposes.
> >> > >
> >> > > The file is so large that I don't think I'll be able to cut it up
> >> ahead
> >> > of
> >> > > time into an inverted table or otherwise (but maybe?), so I'm
> >> effectively
> >> > > looping through
> >> > >
> >> > > I've played with different variants and came up with the following
> >> > > statement to count the number of rows that have column 2 = ABC
> >> > >
> >> > > +/ (3 :'(2{"1 (< ;._1 TAB -.~ y))=<''ABC''')"1  mf
> >> > >
> >> > > This gives the correct result and takes about 102 seconds and only
> >> uses
> >> > > about 2 gig of memory while running and settles back down to 500mb
> >> > >
> >> > > I picked off some of the syntax _1 TAB -.~ from other posts.
> >> > >
> >> > > Is there any ideas on how to make it go faster or am I up against
> >> > hardware
> >> > > limit? By the way, I'm impressed with this speed as is. It takes
> about
> >> > 348
> >> > > seconds to read into R using the heavily optimized data.table fread
> >> > package
> >> > > which also uses memory mapped files. The standard import is more
> than
> >> a
> >> > few
> >> > > hours. I can go from start to finish in J in under 102 seconds.
> >> > >
> ----------------------------------------------------------------------
> >> > > For information about J forums see
> >> http://www.jsoftware.com/forums.htm
> >> > >
> >> > >
> ----------------------------------------------------------------------
> >> > > For information about J forums see
> >> http://www.jsoftware.com/forums.htm
> >> > ----------------------------------------------------------------------
> >> > For information about J forums see
> http://www.jsoftware.com/forums.htm
> >> >
> >> ----------------------------------------------------------------------
> >> For information about J forums see http://www.jsoftware.com/forums.htm
> >>
> >
> >
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to