Matthew Brand wrote:
> I want to read in the file and output a new one with
> zeroes inbetween the two commas.
Ideally, you'd want to be able do this:
(#!.'0'~ 1 j. ',,' E. ]) text
as in:
text =. '0,,34567,,abcd,,efg'
output =. '0,0,34567,0,abcd,0,efg'
text #!.'0'~ 1 j. ',,' E. text
0,0,34567,0,abcd,0,efg
(#!.'0'~ 1 j. ',,' E. ]) text
0,0,34567,0,abcd,0,efg
output -: (#!.'0'~ 1 j. ',,' E. ]) text
1
Unfortunately, as you noted:
> [the file] is so big that doing anything with it
> apart from very basic things in 32-bit seems to
> run out of memory:
So if I try to use the same verb on a larger dataset:
big_text =. 134125010 $ text
(#!.'0'~ 1 j. ',,' E. ]) big_text
|limit error
| (#!.'0'~1 j.',,'E.])big_text
The problem is the intermediate temporary arrays (in this case, the array
produce by j. of 134125010 complex numbers, or 2*134125010 double-precision
floating point numbers, or just over 2 gigabytes, which is the maximum VM of a
process on a 32-bit machine).
The usual solution to this problem is breaking the input into manageable
chunks. I share your distaste for this workaround. But I'm pretty sure it's
necessary in this case (short of using a 64-bit machine).
Now, you requested:
> but I want know how to do it in J efficiently
> without using explicit loops.
So I'm going to have to cheat a little. Instead of using explicit loops, I'm
going to use an implicit loop. Basically, the idea is to create a tacit
equivalent of:
big_output =. ''
for_chunk. chunkify big_text do.
big_output =. big_output , (#!.'0'~ 1 j. ',,' E. ]) chunk
end.
where the monad chunkify breaks its argument into appropriate
manageably-sized chunks. Here's one such approach:
$ _10000 ;@:(<@:(#!.'0'~ 1 j.',,' E. ])\) big_text
155300525
Here, the loop is translated from explicit to tacit with (-chunk_size)
;@:(<@:loop_body\) input .
You'll have to fiddle a bit with chunk_size . The ideal chunk_size is the
largest that will work on your system (i.e. the maximum that doesn't assert
out-of-memory). This is for two reasons. First, the rule of thumb in J is
give primitives as much data as possible (which means if you must partition
your data, it is best to partition it into fewer, larger chunks). Second, the
translation has a bug, but the bug can only occur at most once per partition,
so the larger your chunks, the fewer there will be, and consequently there will
be fewer opportunities for the bug to arise.
The bug is due to the use of fixed-sized chunks. Fixed-sized chunks can
partition the input between any two arbitrary characters. Given a large
enough input, one such partition is bound to fall between two successive
commas. Therefore, those commas will be split up, and cannot be recognized by
the ',,' E. ] pattern matcher, and so that particular instance of ',,'
won't be replaced by ',0,' in the output.
So the translation from the explicit loop, with its hypothetical chunkify ,
is incomplete. If you're really interested in a tacit solution, then the next
avenue to explore is replacing \ with ;. . That is, the final form of the
tacit solution will probably be chunk_mask ;@:(<@:loop_body);.2 input .
The hard part will be calculating chunk_mask (a vector of booleans indicating
where to partition the text) such that minimizes the number of cuts, it never
cuts between two successive commas, and the whole expression doesn't run out of
memory (due to temporary intermediate arrays in the calculation of chunk_mask
or loop_body ).
Left as an exercise to the reader.
-Dan
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm