See the 'prankb' and 'bslash' modifiers from my mailing list posting a few
months ago (or at
https://github.com/moon-chilled/j/blob/master/parallel/p.ijs). (I really
ought to polish these up into a library at some point.) The gist is, if you
have N cores, to slice up 'rows' into N chunks and process one chunk on each
core. Since you want to write a file, you'd be best served by _not_ opening
the boxed result array, but just writing out each element in turn.
On Fri, 7 Oct 2022, David Lambert wrote:
I've got a sparse csv with shape 1183748 2141. Where there is data, it
has usually tally 2, and probably never longer than 7.
My usual methods run out of memory, something like
([: <;._2 ,&',');._2 CR-.~LF_separated_csv
I tried sparse array to store index numbers of data, with data as a
preallocated vector of a: to avoid copy.
Killed at 8 hours runtime, my usual hindsight says I should have
displayed status every 5 minutes.
In this attempt I send the output directly to file, thereby achieving a
durable result.
Since the rows are independent, multiple threads can work different
sections of the mapped csv.
What, please, would be a good way to apply the new threading primitives
to the tokenize verb?
Engine: j904/j64avx/windows
Beta-e: commercial/2022-07-16T19:25:02
Library: 9.04.03
Platform: Win 64
Installer: J904 install
InstallPath: c:/users/user/downloads/j904_win64/j904
require 'jmf'
open=: 1!:21@boxopen
close=: 1!:22
write=: 1!:2~
append=: 1!:3~
testfile=:'c:/Users/user/temp/tc.csv'
datafile=:'c:/Users/user/ZW/kaggle.com/bosch-production-line-performance/train_categorical.csv'
NB. indexes helps extract the row of data between linefeeds x and x+1
indexes=: (>:@{. + [: i.@<: -~/)@({ ~ 0 1&+)~
assert 6 7 8 -: 1 indexes 2 5 9 33
tokenize=: 4 :0 NB. x is file number for write, y is the literal
rows=. _1 , I. LF = y
row_tally=. <: # rows
for_row. i. row_tally do.
echo #fields=. ([: <;._2 ,&',') y {~ row indexes rows
cols=. }. I. a: ~: fields NB. indexes of data in row excluding ID
x append , LF ,.~ (' ' ,.~ ": row ,. cols) ,"1 > cols { fields
end.
)
JCHAR map_jmf_'INF';testfile ] datafile
ouf=: open testfile , '.out'
ouf tokenize INF
close ouf
unmap_jmf_'INF'
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm