Using a sequential machine to get the second column from the whole file and then doing the comparison might be faster. On Oct 10, 2013 9:23 AM, "Raul Miller" <[email protected]> wrote:
> When the size of your data exceeds some significant fraction of > available memory, it's probably worth using a loop. > > In other words: first develop your code so it works on a smaller data > set, then pick some suitably large block size (1GB?) and loop over > however many blocks you need. > > Loops are more complicated and they do have some overhead, but in some > situations those are trivial costs. > > Thanks, > > -- > Raul > > > On Thu, Oct 10, 2013 at 2:27 PM, Pascal Jasmin <[email protected]> > wrote: > > +/ (3 :'(2{"1 (< ;._1 TAB -.~ y))=<''ABC''')"1 mf > > > > > > match '-:' might be faster than =, but overall just a tacit version: > > > > +/ (<'ABC') = 2&{@:(<;._1 TAB -.~ ])"1 mf > > > > untested > > > > > > ----- Original Message ----- > > From: Joe Bogner <[email protected]> > > To: [email protected] > > Cc: > > Sent: Thursday, October 10, 2013 2:02:07 PM > > Subject: [Jprogramming] memory mapped tab delimited file > > > > I have a 5 gig, 9 million row tab delimited file that I'm working with. > > > > I started with a subset of 300k records and used fapplylines. It took > about > > 5 seconds. I shaved 2 seconds off by modifying fapplylines to use memory > > mapped files > > > > I then applied it to my larger file and found that it was taking about > 220 > > seconds. Not bad, but I wanted to push for something faster. > > > > Using a memory mapped file was simple enough. I wrote a routine to add a > > column and pad it to the longest column (600 characters). > > > > $ mf > > 9667548 602 > > > > I'd like to keep it in a tab delimited file if possible because I'm using > > that file for other purposes. > > > > The file is so large that I don't think I'll be able to cut it up ahead > of > > time into an inverted table or otherwise (but maybe?), so I'm effectively > > looping through > > > > I've played with different variants and came up with the following > > statement to count the number of rows that have column 2 = ABC > > > > +/ (3 :'(2{"1 (< ;._1 TAB -.~ y))=<''ABC''')"1 mf > > > > This gives the correct result and takes about 102 seconds and only uses > > about 2 gig of memory while running and settles back down to 500mb > > > > I picked off some of the syntax _1 TAB -.~ from other posts. > > > > Is there any ideas on how to make it go faster or am I up against > hardware > > limit? By the way, I'm impressed with this speed as is. It takes about > 348 > > seconds to read into R using the heavily optimized data.table fread > package > > which also uses memory mapped files. The standard import is more than a > few > > hours. I can go from start to finish in J in under 102 seconds. > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > > > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
