I have a 5 gig, 9 million row tab delimited file that I'm working with.
I started with a subset of 300k records and used fapplylines. It took about
5 seconds. I shaved 2 seconds off by modifying fapplylines to use memory
mapped files
I then applied it to my larger file and found that it was taking about 220
seconds. Not bad, but I wanted to push for something faster.
Using a memory mapped file was simple enough. I wrote a routine to add a
column and pad it to the longest column (600 characters).
$ mf
9667548 602
I'd like to keep it in a tab delimited file if possible because I'm using
that file for other purposes.
The file is so large that I don't think I'll be able to cut it up ahead of
time into an inverted table or otherwise (but maybe?), so I'm effectively
looping through
I've played with different variants and came up with the following
statement to count the number of rows that have column 2 = ABC
+/ (3 :'(2{"1 (< ;._1 TAB -.~ y))=<''ABC''')"1 mf
This gives the correct result and takes about 102 seconds and only uses
about 2 gig of memory while running and settles back down to 500mb
I picked off some of the syntax _1 TAB -.~ from other posts.
Is there any ideas on how to make it go faster or am I up against hardware
limit? By the way, I'm impressed with this speed as is. It takes about 348
seconds to read into R using the heavily optimized data.table fread package
which also uses memory mapped files. The standard import is more than a few
hours. I can go from start to finish in J in under 102 seconds.
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm