Matthew -

if you combine Dan's idea with Raul's, you can output your result to file
instead of building a big variable to avoid the space problems.

Also, note the bug if a line ends in a comma - presumably you mean to
convert (',',LF) to (',0',LF)?

The one missing item is that you need to track if the last character of your
previous chunk was a comma and account for it: you can do this in the above
context by using a global.

I haven't tested this, but you could do something like the following:

fixcc=: (#!.'0'~ 1 j. ',,' E. ])  NB. fix double comma

fixcLF=: (#!.'0'~ 1 j. (',',LF) E. ]) NB. fix comma, LF (line-end character)

fix2=: fixcLF@:fixcc  NB. 2 fixes in 1!

GBLO=: ''    NB. Initialize global leftover to empty

fixChunk=: 3 : 'xx[GBLO=: '',''#'',''={.xx=. (#GBLO)}.fix2 y'  NB. Handle
transition between chunks.

This outputs the fixed chunk and assigns the global to be either comma or
empty vector as necessary.

Not a complete answer but it may help you.

Good Luck,

Devon

On 7/29/08, Raul Miller <[EMAIL PROTECTED]> wrote:
>
> On 7/29/08, Matthew Brand <[EMAIL PROTECTED]> wrote:
> > Does anyone know a faster algorithm to do this on such a large file? Can
> it
> > be done in a 32-bit address space?  The problem can be solved by
> streaming
> > through the data in C++, but I want know how to do it in J efficiently
> > without using explicit loops.
>
>
> Your file is over a gigabyte -- just writing that much data will take a
> lot of time (how much time depends on your disk -- its speed, how
> much space it has, and how fragmented that space is).
>
> That said, this could be made to work in a 32 bit address space.  The
> trick is that you do not have to process your entire file at once.
>
> require'csv'
> fixcsv ('0,0,0',LF),'0,0,0',LF
>
> The fixcsv routine will take a csv text element and convert it
> to the corresponding table structure.
>
> Hypothetically speaking, you could read blocks of data in
> (using 1!:11), process them, then append them to a result
> file (using 1!:3).  You would also want to keep track of any
> line fragment (the characters following the last LF in your
> block) and pre-pend them to the next block that you read
> in, but that's fairly simple.
>
> With an appropriate block size (maybe a megabyte? 10MB?)
> your overhead from J should not be too bad, but your intermediate
> results should not be too large.
>
> This will not be particularly quick -- not with that much data -- but
> you can take some comfort in being able to watch the result file
> growing as it gets processed.
>
> FYI,
>
> --
>
> Raul
>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>



-- 
Devon McCormick, CFA
^me^ at acm.
org is my
preferred e-mail
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to