If you can make a memory mapped file approach work for you, I expect
that it would be a lot faster than a buffered block approach, at least
on current machines.

Thanks,

-- 
Raul

On Fri, Oct 11, 2013 at 7:46 AM, Joe Bogner <[email protected]> wrote:
> Ganesh - Thank you. That is very fast. It runs on 4 seconds on my machine
> with a 9 million row sample string.
>
> st =. 1 0 , 0 6 ,: 0 0
>
> st =. st ,: 0 6 , 0 6 ,: 2 1
>
> st =. st , 3 3 , 0 6 ,: 2 0
>
> st =. st , 3 0 , 0 0 ,: 3 0
>
> tab =. 9 { a.
>
> secondcolumn =. (0;st;(< tab;LF))&;: NB. Boxed list of the second column
>
>
> lines=.('foo',TAB,'ABC',TAB,LF,'foo',TAB,'Q',TAB,LF)
>
> mf=. , L:0 (9e6, # lines) $ lines
>
> (6!:2) 'c=:(<''ABC'') +/@:= secondcolumn mf'
>
> c
>
>
> On my real data it takes 49 seconds -- the string is considerably longer. I
> estimate 3.6 billion bytes vs 144 million.  It doesn't produce the correct
> result because it doesn't hold to the assumptions, however it's great to
> see that what is possible with more work.
>
>
>
> Ric - I haven't tested what you provided yet. Conceptually, is there any
> reason to believe that it'd be faster than working with fixed width memory
> mapped file? The line terminations are pre-determined at that point, so
> it's just a matter of looping over the indices and reformatting each line.
>
>
> I also don't quite understand the difference between fapplylines and
> freadblock. It sounds like the freadblock discards the partial lines and
> returns the new index and fapplylines retains the partial.
> http://www.jsoftware.com/jwiki/RicSherlock/FileProcessing, It sounds like
> freadblock is more like what Raul is describing. I'm fuzzy because both
> process the data in chunks. Thanks again.
>
>
> On Thu, Oct 10, 2013 at 8:26 PM, Ganesh Rapolu <[email protected]> wrote:
>
>> Oops. The previous code gives the first column. This is the corrected code,
>> with the same assumptions as before but with more
>> than 3 columns.
>>
>> st =. ,: 1 0 , 0 6 ,: 0 0
>> st =. st , 2 0 , 0 6 ,: 1 0
>> st =. st , 0 6 , 0 6 ,: 3 1
>> st =. st , 4 3 , 0 6 ,: 3 0
>> st =. st , 4 0 , 0 0 ,: 4 0
>> tab =. 9 { a. NB. safer for email
>> secondcolumn =. (0;st;(< tab;LF))&;:  NB. Boxed list of the second column
>>
>> (<'ABC') +/@:= secondcolumn mf
>>
>>
>>
>> On Thu, Oct 10, 2013 at 2:14 PM, Ganesh Rapolu <[email protected]>
>> wrote:
>>
>> > Assumptions:
>> > - no double tabs
>> > - no blank lines
>> > - no blank columns
>> > - no line starts with a tab
>> > - no column itself contains a tab
>> > - no CR
>> > - more than 2 columns
>> > - file is whole (rank 1 and not split by LF)
>> >
>> >
>> > st =. 1 0 , 0 6 ,: 0 0
>> > st =. st ,: 0 6 , 0 6 ,: 2 1
>> > st =. st , 3 3 , 0 6 ,: 2 0
>> > st =. st , 3 0 , 0 0 ,: 3 0
>> > tab =. 9 { a.
>> > secondcolumn =. (0;st;(< tab;LF))&;:  NB. Boxed list of the second column
>> >
>> > (<'ABC') +/@:= secondcolumn mf
>> >
>> >  > > ----- Original Message -----
>> >> > > From: Joe Bogner <[email protected]>
>> >> > > To: [email protected]
>> >> > > Cc:
>> >> > > Sent: Thursday, October 10, 2013 2:02:07 PM
>> >> > > Subject: [Jprogramming] memory mapped tab delimited file
>> >> > >
>> >> > > I have a 5 gig, 9 million row tab delimited file that I'm working
>> >> with.
>> >> > >
>> >> > > I started with a subset of 300k records and used fapplylines. It
>> took
>> >> > about
>> >> > > 5 seconds. I shaved 2 seconds off by modifying fapplylines to use
>> >> memory
>> >> > > mapped files
>> >> > >
>> >> > > I then applied it to my larger file and found that it was taking
>> about
>> >> > 220
>> >> > > seconds. Not bad, but I wanted to push for something faster.
>> >> > >
>> >> > > Using a memory mapped file was simple enough. I wrote a routine to
>> >> add a
>> >> > > column and pad it to the longest column (600 characters).
>> >> > >
>> >> > > $ mf
>> >> > > 9667548 602
>> >> > >
>> >> > > I'd like to keep it in a tab delimited file if possible because I'm
>> >> using
>> >> > > that file for other purposes.
>> >> > >
>> >> > > The file is so large that I don't think I'll be able to cut it up
>> >> ahead
>> >> > of
>> >> > > time into an inverted table or otherwise (but maybe?), so I'm
>> >> effectively
>> >> > > looping through
>> >> > >
>> >> > > I've played with different variants and came up with the following
>> >> > > statement to count the number of rows that have column 2 = ABC
>> >> > >
>> >> > > +/ (3 :'(2{"1 (< ;._1 TAB -.~ y))=<''ABC''')"1  mf
>> >> > >
>> >> > > This gives the correct result and takes about 102 seconds and only
>> >> uses
>> >> > > about 2 gig of memory while running and settles back down to 500mb
>> >> > >
>> >> > > I picked off some of the syntax _1 TAB -.~ from other posts.
>> >> > >
>> >> > > Is there any ideas on how to make it go faster or am I up against
>> >> > hardware
>> >> > > limit? By the way, I'm impressed with this speed as is. It takes
>> about
>> >> > 348
>> >> > > seconds to read into R using the heavily optimized data.table fread
>> >> > package
>> >> > > which also uses memory mapped files. The standard import is more
>> than
>> >> a
>> >> > few
>> >> > > hours. I can go from start to finish in J in under 102 seconds.
>> >> > >
>> ----------------------------------------------------------------------
>> >> > > For information about J forums see
>> >> http://www.jsoftware.com/forums.htm
>> >> > >
>> >> > >
>> ----------------------------------------------------------------------
>> >> > > For information about J forums see
>> >> http://www.jsoftware.com/forums.htm
>> >> > ----------------------------------------------------------------------
>> >> > For information about J forums see
>> http://www.jsoftware.com/forums.htm
>> >> >
>> >> ----------------------------------------------------------------------
>> >> For information about J forums see http://www.jsoftware.com/forums.htm
>> >>
>> >
>> >
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
>>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to