If you can make a memory mapped file approach work for you, I expect that it would be a lot faster than a buffered block approach, at least on current machines.
Thanks, -- Raul On Fri, Oct 11, 2013 at 7:46 AM, Joe Bogner <[email protected]> wrote: > Ganesh - Thank you. That is very fast. It runs on 4 seconds on my machine > with a 9 million row sample string. > > st =. 1 0 , 0 6 ,: 0 0 > > st =. st ,: 0 6 , 0 6 ,: 2 1 > > st =. st , 3 3 , 0 6 ,: 2 0 > > st =. st , 3 0 , 0 0 ,: 3 0 > > tab =. 9 { a. > > secondcolumn =. (0;st;(< tab;LF))&;: NB. Boxed list of the second column > > > lines=.('foo',TAB,'ABC',TAB,LF,'foo',TAB,'Q',TAB,LF) > > mf=. , L:0 (9e6, # lines) $ lines > > (6!:2) 'c=:(<''ABC'') +/@:= secondcolumn mf' > > c > > > On my real data it takes 49 seconds -- the string is considerably longer. I > estimate 3.6 billion bytes vs 144 million. It doesn't produce the correct > result because it doesn't hold to the assumptions, however it's great to > see that what is possible with more work. > > > > Ric - I haven't tested what you provided yet. Conceptually, is there any > reason to believe that it'd be faster than working with fixed width memory > mapped file? The line terminations are pre-determined at that point, so > it's just a matter of looping over the indices and reformatting each line. > > > I also don't quite understand the difference between fapplylines and > freadblock. It sounds like the freadblock discards the partial lines and > returns the new index and fapplylines retains the partial. > http://www.jsoftware.com/jwiki/RicSherlock/FileProcessing, It sounds like > freadblock is more like what Raul is describing. I'm fuzzy because both > process the data in chunks. Thanks again. > > > On Thu, Oct 10, 2013 at 8:26 PM, Ganesh Rapolu <[email protected]> wrote: > >> Oops. The previous code gives the first column. This is the corrected code, >> with the same assumptions as before but with more >> than 3 columns. >> >> st =. ,: 1 0 , 0 6 ,: 0 0 >> st =. st , 2 0 , 0 6 ,: 1 0 >> st =. st , 0 6 , 0 6 ,: 3 1 >> st =. st , 4 3 , 0 6 ,: 3 0 >> st =. st , 4 0 , 0 0 ,: 4 0 >> tab =. 9 { a. NB. safer for email >> secondcolumn =. (0;st;(< tab;LF))&;: NB. Boxed list of the second column >> >> (<'ABC') +/@:= secondcolumn mf >> >> >> >> On Thu, Oct 10, 2013 at 2:14 PM, Ganesh Rapolu <[email protected]> >> wrote: >> >> > Assumptions: >> > - no double tabs >> > - no blank lines >> > - no blank columns >> > - no line starts with a tab >> > - no column itself contains a tab >> > - no CR >> > - more than 2 columns >> > - file is whole (rank 1 and not split by LF) >> > >> > >> > st =. 1 0 , 0 6 ,: 0 0 >> > st =. st ,: 0 6 , 0 6 ,: 2 1 >> > st =. st , 3 3 , 0 6 ,: 2 0 >> > st =. st , 3 0 , 0 0 ,: 3 0 >> > tab =. 9 { a. >> > secondcolumn =. (0;st;(< tab;LF))&;: NB. Boxed list of the second column >> > >> > (<'ABC') +/@:= secondcolumn mf >> > >> > > > ----- Original Message ----- >> >> > > From: Joe Bogner <[email protected]> >> >> > > To: [email protected] >> >> > > Cc: >> >> > > Sent: Thursday, October 10, 2013 2:02:07 PM >> >> > > Subject: [Jprogramming] memory mapped tab delimited file >> >> > > >> >> > > I have a 5 gig, 9 million row tab delimited file that I'm working >> >> with. >> >> > > >> >> > > I started with a subset of 300k records and used fapplylines. It >> took >> >> > about >> >> > > 5 seconds. I shaved 2 seconds off by modifying fapplylines to use >> >> memory >> >> > > mapped files >> >> > > >> >> > > I then applied it to my larger file and found that it was taking >> about >> >> > 220 >> >> > > seconds. Not bad, but I wanted to push for something faster. >> >> > > >> >> > > Using a memory mapped file was simple enough. I wrote a routine to >> >> add a >> >> > > column and pad it to the longest column (600 characters). >> >> > > >> >> > > $ mf >> >> > > 9667548 602 >> >> > > >> >> > > I'd like to keep it in a tab delimited file if possible because I'm >> >> using >> >> > > that file for other purposes. >> >> > > >> >> > > The file is so large that I don't think I'll be able to cut it up >> >> ahead >> >> > of >> >> > > time into an inverted table or otherwise (but maybe?), so I'm >> >> effectively >> >> > > looping through >> >> > > >> >> > > I've played with different variants and came up with the following >> >> > > statement to count the number of rows that have column 2 = ABC >> >> > > >> >> > > +/ (3 :'(2{"1 (< ;._1 TAB -.~ y))=<''ABC''')"1 mf >> >> > > >> >> > > This gives the correct result and takes about 102 seconds and only >> >> uses >> >> > > about 2 gig of memory while running and settles back down to 500mb >> >> > > >> >> > > I picked off some of the syntax _1 TAB -.~ from other posts. >> >> > > >> >> > > Is there any ideas on how to make it go faster or am I up against >> >> > hardware >> >> > > limit? By the way, I'm impressed with this speed as is. It takes >> about >> >> > 348 >> >> > > seconds to read into R using the heavily optimized data.table fread >> >> > package >> >> > > which also uses memory mapped files. The standard import is more >> than >> >> a >> >> > few >> >> > > hours. I can go from start to finish in J in under 102 seconds. >> >> > > >> ---------------------------------------------------------------------- >> >> > > For information about J forums see >> >> http://www.jsoftware.com/forums.htm >> >> > > >> >> > > >> ---------------------------------------------------------------------- >> >> > > For information about J forums see >> >> http://www.jsoftware.com/forums.htm >> >> > ---------------------------------------------------------------------- >> >> > For information about J forums see >> http://www.jsoftware.com/forums.htm >> >> > >> >> ---------------------------------------------------------------------- >> >> For information about J forums see http://www.jsoftware.com/forums.htm >> >> >> > >> > >> ---------------------------------------------------------------------- >> For information about J forums see http://www.jsoftware.com/forums.htm >> > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
