Re: [Jprogramming] memory mapped tab delimited file

Ganesh Rapolu Thu, 10 Oct 2013 15:23:37 -0700

Using a sequential machine to get the second column from the whole file and
then doing the comparison might be faster.
On Oct 10, 2013 9:23 AM, "Raul Miller" <[email protected]> wrote:


> When the size of your data exceeds some significant fraction of
> available memory, it's probably worth using a loop.
>
> In other words: first develop your code so it works on a smaller data
> set, then pick some suitably large block size (1GB?) and loop over
> however many blocks you need.
>
> Loops are more complicated and they do have some overhead, but in some
> situations those are trivial costs.
>
> Thanks,
>
> --
> Raul
>
>
> On Thu, Oct 10, 2013 at 2:27 PM, Pascal Jasmin <[email protected]>
> wrote:
> > +/ (3 :'(2{"1 (< ;._1 TAB -.~ y))=<''ABC''')"1  mf
> >
> >
> > match '-:' might be faster than =, but overall just a tacit version:
> >
> > +/ (<'ABC') = 2&{@:(<;._1 TAB -.~ ])"1 mf
> >
> > untested
> >
> >
> > ----- Original Message -----
> > From: Joe Bogner <[email protected]>
> > To: [email protected]
> > Cc:
> > Sent: Thursday, October 10, 2013 2:02:07 PM
> > Subject: [Jprogramming] memory mapped tab delimited file
> >
> > I have a 5 gig, 9 million row tab delimited file that I'm working with.
> >
> > I started with a subset of 300k records and used fapplylines. It took
> about
> > 5 seconds. I shaved 2 seconds off by modifying fapplylines to use memory
> > mapped files
> >
> > I then applied it to my larger file and found that it was taking about
> 220
> > seconds. Not bad, but I wanted to push for something faster.
> >
> > Using a memory mapped file was simple enough. I wrote a routine to add a
> > column and pad it to the longest column (600 characters).
> >
> > $ mf
> > 9667548 602
> >
> > I'd like to keep it in a tab delimited file if possible because I'm using
> > that file for other purposes.
> >
> > The file is so large that I don't think I'll be able to cut it up ahead
> of
> > time into an inverted table or otherwise (but maybe?), so I'm effectively
> > looping through
> >
> > I've played with different variants and came up with the following
> > statement to count the number of rows that have column 2 = ABC
> >
> > +/ (3 :'(2{"1 (< ;._1 TAB -.~ y))=<''ABC''')"1  mf
> >
> > This gives the correct result and takes about 102 seconds and only uses
> > about 2 gig of memory while running and settles back down to 500mb
> >
> > I picked off some of the syntax _1 TAB -.~ from other posts.
> >
> > Is there any ideas on how to make it go faster or am I up against
> hardware
> > limit? By the way, I'm impressed with this speed as is. It takes about
> 348
> > seconds to read into R using the heavily optimized data.table fread
> package
> > which also uses memory mapped files. The standard import is more than a
> few
> > hours. I can go from start to finish in J in under 102 seconds.
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> >
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] memory mapped tab delimited file

Reply via email to