Oops. The previous code gives the first column. This is the corrected code,
with the same assumptions as before but with more
than 3 columns.

st =. ,: 1 0 , 0 6 ,: 0 0
st =. st , 2 0 , 0 6 ,: 1 0
st =. st , 0 6 , 0 6 ,: 3 1
st =. st , 4 3 , 0 6 ,: 3 0
st =. st , 4 0 , 0 0 ,: 4 0
tab =. 9 { a. NB. safer for email
secondcolumn =. (0;st;(< tab;LF))&;:  NB. Boxed list of the second column

(<'ABC') +/@:= secondcolumn mf



On Thu, Oct 10, 2013 at 2:14 PM, Ganesh Rapolu <[email protected]> wrote:

> Assumptions:
> - no double tabs
> - no blank lines
> - no blank columns
> - no line starts with a tab
> - no column itself contains a tab
> - no CR
> - more than 2 columns
> - file is whole (rank 1 and not split by LF)
>
>
> st =. 1 0 , 0 6 ,: 0 0
> st =. st ,: 0 6 , 0 6 ,: 2 1
> st =. st , 3 3 , 0 6 ,: 2 0
> st =. st , 3 0 , 0 0 ,: 3 0
> tab =. 9 { a.
> secondcolumn =. (0;st;(< tab;LF))&;:  NB. Boxed list of the second column
>
> (<'ABC') +/@:= secondcolumn mf
>
>  > > ----- Original Message -----
>> > > From: Joe Bogner <[email protected]>
>> > > To: [email protected]
>> > > Cc:
>> > > Sent: Thursday, October 10, 2013 2:02:07 PM
>> > > Subject: [Jprogramming] memory mapped tab delimited file
>> > >
>> > > I have a 5 gig, 9 million row tab delimited file that I'm working
>> with.
>> > >
>> > > I started with a subset of 300k records and used fapplylines. It took
>> > about
>> > > 5 seconds. I shaved 2 seconds off by modifying fapplylines to use
>> memory
>> > > mapped files
>> > >
>> > > I then applied it to my larger file and found that it was taking about
>> > 220
>> > > seconds. Not bad, but I wanted to push for something faster.
>> > >
>> > > Using a memory mapped file was simple enough. I wrote a routine to
>> add a
>> > > column and pad it to the longest column (600 characters).
>> > >
>> > > $ mf
>> > > 9667548 602
>> > >
>> > > I'd like to keep it in a tab delimited file if possible because I'm
>> using
>> > > that file for other purposes.
>> > >
>> > > The file is so large that I don't think I'll be able to cut it up
>> ahead
>> > of
>> > > time into an inverted table or otherwise (but maybe?), so I'm
>> effectively
>> > > looping through
>> > >
>> > > I've played with different variants and came up with the following
>> > > statement to count the number of rows that have column 2 = ABC
>> > >
>> > > +/ (3 :'(2{"1 (< ;._1 TAB -.~ y))=<''ABC''')"1  mf
>> > >
>> > > This gives the correct result and takes about 102 seconds and only
>> uses
>> > > about 2 gig of memory while running and settles back down to 500mb
>> > >
>> > > I picked off some of the syntax _1 TAB -.~ from other posts.
>> > >
>> > > Is there any ideas on how to make it go faster or am I up against
>> > hardware
>> > > limit? By the way, I'm impressed with this speed as is. It takes about
>> > 348
>> > > seconds to read into R using the heavily optimized data.table fread
>> > package
>> > > which also uses memory mapped files. The standard import is more than
>> a
>> > few
>> > > hours. I can go from start to finish in J in under 102 seconds.
>> > > ----------------------------------------------------------------------
>> > > For information about J forums see
>> http://www.jsoftware.com/forums.htm
>> > >
>> > > ----------------------------------------------------------------------
>> > > For information about J forums see
>> http://www.jsoftware.com/forums.htm
>> > ----------------------------------------------------------------------
>> > For information about J forums see http://www.jsoftware.com/forums.htm
>> >
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
>>
>
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to