Mariuz, I've used the following adverb (see below) to process 4gig CSVs. Basically it works through the file in byte chunks. As the j forum email tends to wreak embedded code you can see how this adv is used in the database ETL system that uses it here:
https://bakerjd99.files.wordpress.com/2021/08/swiftprep.pdf You might also find this amusing: https://analyzethedatanotthedrivel.org/2021/08/11/jetl-j-extract-transform-and-load/ ireadapply=:1 : 0 NB.*ireadapply v-- apply verb (u) to n byte line blocks of static file. NB. NB. adv: u ireadapply (clFileIn ; clFileOut ; clDel ; iaBlockSize ;< uuData) NB. NB. fi=. winpathsep ;1 dir SwiftZipCsvDir,'ItemSales-*.csv' NB. fo=. SwiftTsvDir,'land_ItemSales.txt' NB. smoutput@:(>@{. ; ($&.>)@:}.) ireadapply fi;fo;CRLF;20000000;<'' NB. file in, file out, line delimiter, block size, (u) verb data 'fi fo d k ud'=. y p=. 0 NB. file pointer c=. 0 NB. block count s=. fsize fi NB. file bytes k=. k<.s NB. first block size NB.debug. b=. i.0 NB. block sizes (chk) while. p < s do. 'iread error' assert -. _1 -: r=. (1!:11 :: _1:) fi;p,k c=. >:c NB. block count NB. complete lines if. 0 = #l=. d beforelaststr r do. NB. final shard NB.debug. b=. b,#r u c;1;d;fo;r;<ud break. end. p=. p + #l NB. inc file pointer k=. k <. s - p NB. next block size NB.debug. b=. b,#l NB. block sizes list NB. block number, shard, delimiter, file out, line bytes, (u) data u c;0;d;fo;l;<ud end. NB.debug. 'byte mismatch' assert s = +/b c NB. blocks processed ) On Mon, Aug 16, 2021 at 7:17 PM Raul Miller <[email protected]> wrote: > 1, As you have noticed, certainly. There's details, of course (what > block size to use? Are files guaranteed to be well formed? If not, > what are error conditions? (are certain characters illegal? Are lines > longer than the block size allowed?) Do you want a callback interface > for each block? If so, do you need an "end of file" indication? If so, > is that a separate callback or a distinct argument to the block > callback? etc.) > > 2. Again, as you have noticed: yes. And, there are analogous details > here... > > 3. The expat API should only require J knowledge. There are a couple > examples in the addons/api/expat/test/ directory named test0.ijs and > test1.ijs > > I hope this helps, > > -- > Raul > > On Mon, Aug 16, 2021 at 4:23 PM Mariusz Grasko > <[email protected]> wrote: > > > > Thank you for some ideas on using external parser. > > Okay now I have 3 questions: > > 1. Is it possible to read CSV file streaming-style (for example record by > > record) without loading everything in memory ? Even if I use some > external > > parsing solution like XSLT or just write something myself in some other > > language than J, I will end up with large CSV instead of large XML. It > > makes no difference. The reason that I need to parse it like this, is > that > > there are some rows that I won't need, those would be discarded depending > > on their field values. > > If it is not possible I would do more work outside of J in this first > > parser XML -> CSV. > > 2. Is there a way to call external program for J script ? If it is > > possible to wait for it to finish ? > > If it is not possible, there are definiately ways to run J from other > > programs. > > 3. Can someone give a little bit of pointer or on how to use api/expat > > library ? Do I need to familiarize myself with expat (C library) or just > > good understanding of J and reading small test in package directory > should > > be enough ? > > I could send some example file like Devon McCormick suggested. > > > > Right now I am working through book "J:The natural language for analytic > > computing" and playing around with problems like Project Euler, but I > could > > really see myself using J in serious work. > > > > Best regards, > > MG > > > > > > śr., 11 sie 2021 o 09:51 <[email protected]> napisał(a): > > > > > In similar situations -but my files are not huge- I extract what I want > > > into flattened CSV using one or more XQuery scripts, and then load the > CSV > > > files with J. The code is clean, compact and easy to maintain. For > > > recurrent XQuery patterns, m4 occasionally comes to the rescue. Expect > > > minor portability issues when using different XQuery processors > > > (extensions, language level...). > > > > > > > > > > > > Never got round to SAX parsing beyond tutorials, so I cannot compare. > > > > > > > > > De : Mariusz Grasko <[email protected]> > > > À : [email protected] > > > Sujet : [Jprogramming] Is is good idea to use J for reading large XML > > > files ? > > > Date : 10/08/2021 18:05:45 Europe/Paris > > > > > > Hi, > > > > > > We are ecommerce company and have a lot of integrations with suppliers, > > > products info is nearly always in XML files. I am thinking about using > J as > > > an analysis tool, do you think that working with large files that need > to > > > be parsed SAX- style without reading everything at once is good idea > in J ? > > > Also is this even advantageous (as in, would code be terse). Right now > XML > > > parsing is done in Golang, so if parsing in J is not very good we > could try > > > to rely more on CSV exports. CSV is definiately very good in J. > > > I am hoping that maybe XML parsing is very good in J and the code would > > > become much smaller, if this is the case, then I would think about > using J > > > for XMLs with new suppliers. > > > > > > Best Regards > > > M.G. > > > ---------------------------------------------------------------------- > > > For information about J forums see http://www.jsoftware.com/forums.htm > > > > > > ---------------------------------------------------------------------- > > > For information about J forums see http://www.jsoftware.com/forums.htm > > > > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > -- John D. Baker [email protected] ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
