Re: [Jprogramming] memory mapped tab delimited file

Raul Miller Thu, 10 Oct 2013 12:23:44 -0700

When the size of your data exceeds some significant fraction of
available memory, it's probably worth using a loop.


In other words: first develop your code so it works on a smaller data
set, then pick some suitably large block size (1GB?) and loop over
however many blocks you need.

Loops are more complicated and they do have some overhead, but in some
situations those are trivial costs.

Thanks,

-- 
Raul


On Thu, Oct 10, 2013 at 2:27 PM, Pascal Jasmin <[email protected]> wrote:
> +/ (3 :'(2{"1 (< ;._1 TAB -.~ y))=<''ABC''')"1  mf
>
>
> match '-:' might be faster than =, but overall just a tacit version:
>
> +/ (<'ABC') = 2&{@:(<;._1 TAB -.~ ])"1 mf
>
> untested
>
>
> ----- Original Message -----
> From: Joe Bogner <[email protected]>
> To: [email protected]
> Cc:
> Sent: Thursday, October 10, 2013 2:02:07 PM
> Subject: [Jprogramming] memory mapped tab delimited file
>
> I have a 5 gig, 9 million row tab delimited file that I'm working with.
>
> I started with a subset of 300k records and used fapplylines. It took about
> 5 seconds. I shaved 2 seconds off by modifying fapplylines to use memory
> mapped files
>
> I then applied it to my larger file and found that it was taking about 220
> seconds. Not bad, but I wanted to push for something faster.
>
> Using a memory mapped file was simple enough. I wrote a routine to add a
> column and pad it to the longest column (600 characters).
>
> $ mf
> 9667548 602
>
> I'd like to keep it in a tab delimited file if possible because I'm using
> that file for other purposes.
>
> The file is so large that I don't think I'll be able to cut it up ahead of
> time into an inverted table or otherwise (but maybe?), so I'm effectively
> looping through
>
> I've played with different variants and came up with the following
> statement to count the number of rows that have column 2 = ABC
>
> +/ (3 :'(2{"1 (< ;._1 TAB -.~ y))=<''ABC''')"1  mf
>
> This gives the correct result and takes about 102 seconds and only uses
> about 2 gig of memory while running and settles back down to 500mb
>
> I picked off some of the syntax _1 TAB -.~ from other posts.
>
> Is there any ideas on how to make it go faster or am I up against hardware
> limit? By the way, I'm impressed with this speed as is. It takes about 348
> seconds to read into R using the heavily optimized data.table fread package
> which also uses memory mapped files. The standard import is more than a few
> hours. I can go from start to finish in J in under 102 seconds.
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] memory mapped tab delimited file

Reply via email to