See the 'prankb' and 'bslash' modifiers from my mailing list posting a few months ago (or at https://github.com/moon-chilled/j/blob/master/parallel/p.ijs). (I really ought to polish these up into a library at some point.) The gist is, if you have N cores, to slice up 'rows' into N chunks and process one chunk on each core. Since you want to write a file, you'd be best served by _not_ opening the boxed result array, but just writing out each element in turn.

On Fri, 7 Oct 2022, David Lambert wrote:

I've got a sparse csv with shape 1183748 2141.  Where there is data, it has usually tally 2, and probably never longer than 7.
My usual methods run out of memory, something like
([: <;._2 ,&',');._2  CR-.~LF_separated_csv

I tried sparse array to store index numbers of data, with data as a preallocated vector of a: to avoid copy. Killed at 8 hours runtime, my usual hindsight says I should have displayed status every 5 minutes.

In this attempt I send the output directly to file, thereby achieving a durable result.

Since the rows are independent, multiple threads can work different sections of the mapped csv.

What, please, would be a good way to apply the new threading primitives to the tokenize verb?


Engine: j904/j64avx/windows
Beta-e: commercial/2022-07-16T19:25:02
Library: 9.04.03
Platform: Win 64
Installer: J904 install
InstallPath: c:/users/user/downloads/j904_win64/j904


require 'jmf'
open=: 1!:21@boxopen
close=: 1!:22
write=: 1!:2~
append=: 1!:3~

testfile=:'c:/Users/user/temp/tc.csv'
datafile=:'c:/Users/user/ZW/kaggle.com/bosch-production-line-performance/train_categorical.csv'

NB. indexes helps extract the row of data between linefeeds x and x+1
indexes=: (>:@{. + [: i.@<: -~/)@({ ~ 0 1&+)~
assert 6 7 8 -: 1 indexes 2   5 9   33

tokenize=: 4 :0  NB. x is file number for write, y is the literal
 rows=. _1 , I. LF = y
 row_tally=. <: # rows
 for_row. i. row_tally do.
  echo #fields=. ([: <;._2 ,&',') y {~ row indexes rows
  cols=. }. I. a: ~: fields  NB. indexes of data in row excluding ID
  x append , LF ,.~ (' ' ,.~ ": row ,. cols) ,"1 > cols { fields
 end.
)

JCHAR map_jmf_'INF';testfile ] datafile
ouf=: open testfile , '.out'

ouf tokenize INF

close ouf
unmap_jmf_'INF'

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to