Re: [Jprogramming] Insert zeroes into large data file

Dan Bron Tue, 29 Jul 2008 13:03:24 -0700

Matthew Brand wrote:
> I want to read in the file and output a new one with
> zeroes inbetween the two commas.


Ideally, you'd want to be able do this:

           (#!.'0'~ 1 j. ',,' E. ]) text
as in:

           text   =.  '0,,34567,,abcd,,efg'
           
           output =.  '0,0,34567,0,abcd,0,efg'
        
           text #!.'0'~ 1 j. ',,' E. text
        0,0,34567,0,abcd,0,efg
        
           (#!.'0'~ 1 j. ',,' E. ]) text
        0,0,34567,0,abcd,0,efg
           
           output -: (#!.'0'~ 1 j. ',,' E. ]) text
        1

Unfortunately, as you noted:

>  [the file] is so big that doing anything with it
>  apart from very basic things in 32-bit seems to 
>  run out of memory:

So if I try to use the same verb on a larger dataset:

           big_text =.  134125010 $ text

           (#!.'0'~ 1 j. ',,' E. ]) big_text
        |limit error
        |       (#!.'0'~1 j.',,'E.])big_text

The problem is the intermediate temporary arrays (in this case, the array 
produce by  j.  of 134125010 complex numbers, or 2*134125010 double-precision 
floating point numbers, or just over 2 gigabytes, which is the maximum VM of a 
process on a 32-bit machine).

The usual solution to this problem is breaking the input into manageable 
chunks.  I share your distaste for this workaround.  But I'm pretty sure it's 
necessary in this case (short of using a 64-bit machine).

Now, you requested:

>  but I want know how to do it in J efficiently
>  without using explicit loops.

So I'm going to have to cheat a little.  Instead of using explicit loops, I'm 
going to use an implicit loop.  Basically, the idea is to create a tacit 
equivalent of:

           big_output =. ''
        
           for_chunk.  chunkify  big_text do.
                big_output =. big_output , (#!.'0'~ 1 j. ',,' E. ]) chunk
           end.

where the monad  chunkify  breaks its argument into appropriate 
manageably-sized chunks.  Here's one such approach:

           $  _10000 ;@:(<@:(#!.'0'~ 1 j.',,' E. ])\) big_text
        155300525

Here, the loop is translated from explicit to tacit with  (-chunk_size) 
;@:(<@:loop_body\) input  .  

You'll have to fiddle a bit with  chunk_size  .  The ideal  chunk_size  is the 
largest that will work on your system (i.e. the maximum that doesn't assert 
out-of-memory).  This is for two reasons.  First, the rule of thumb in J is 
give primitives as much data as possible (which means if you must partition 
your data, it is best to partition it into fewer, larger chunks).  Second, the 
translation has a bug, but the bug can only occur at most once per partition, 
so the larger your chunks, the fewer there will be, and consequently there will 
be fewer opportunities for the bug to arise.

The bug is due to the use of fixed-sized chunks.  Fixed-sized chunks can 
partition the input between any two arbitrary characters.   Given a large 
enough input, one such partition is bound to fall between two successive 
commas.  Therefore, those commas will be split up, and cannot be recognized by 
the  ',,' E. ]  pattern matcher, and so that particular instance of  ',,'  
won't be replaced by  ',0,'  in the output.  

So the translation from the explicit loop, with its hypothetical  chunkify  , 
is incomplete.  If you're really interested in a tacit solution, then the next 
avenue to explore is replacing   \  with  ;.  .  That is, the final form of the 
tacit solution will probably be  chunk_mask ;@:(<@:loop_body);.2 input  .  

The hard part will be calculating  chunk_mask  (a vector of booleans indicating 
where to partition the text) such that minimizes the number of cuts, it never 
cuts between two successive commas, and the whole expression doesn't run out of 
memory (due to temporary intermediate arrays in the calculation of  chunk_mask  
or  loop_body  ).

Left as an exercise to the reader.

-Dan

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Insert zeroes into large data file

Reply via email to