Mariuz,

I've used the following adverb (see below) to process 4gig CSVs. Basically
it works
through the file in byte chunks.  As the j forum email tends to wreak
embedded
code you can see how this adv is used in the database ETL system that uses
it
here:

https://bakerjd99.files.wordpress.com/2021/08/swiftprep.pdf

You might also find this amusing:

https://analyzethedatanotthedrivel.org/2021/08/11/jetl-j-extract-transform-and-load/

ireadapply=:1 : 0


NB.*ireadapply v-- apply verb (u) to n byte line blocks of static file.

NB.

NB. adv: u ireadapply (clFileIn ; clFileOut ; clDel ; iaBlockSize ;< uuData)

NB.

NB. fi=. winpathsep ;1 dir SwiftZipCsvDir,'ItemSales-*.csv'

NB. fo=. SwiftTsvDir,'land_ItemSales.txt'

NB. smoutput@:(>@{. ; ($&.>)@:}.) ireadapply fi;fo;CRLF;20000000;<''


NB. file in, file out, line delimiter, block size, (u) verb data

'fi fo d k ud'=. y


p=. 0 NB. file pointer

c=. 0 NB. block count

s=. fsize fi NB. file bytes

k=. k<.s NB. first block size

NB.debug. b=. i.0 NB. block sizes (chk)


while. p < s do.

'iread error' assert -. _1 -: r=. (1!:11 :: _1:) fi;p,k

c=. >:c NB. block count

NB. complete lines

if. 0 = #l=. d beforelaststr r do.

NB. final shard

NB.debug. b=. b,#r

u c;1;d;fo;r;<ud break.

end.

p=. p + #l NB. inc file pointer

k=. k <. s - p NB. next block size

NB.debug. b=. b,#l NB. block sizes list

NB. block number, shard, delimiter, file out, line bytes, (u) data

u c;0;d;fo;l;<ud

end.


NB.debug. 'byte mismatch' assert s = +/b

c NB. blocks processed

)

On Mon, Aug 16, 2021 at 7:17 PM Raul Miller <[email protected]> wrote:

> 1, As you have noticed, certainly. There's details, of course (what
> block size to use? Are files guaranteed to be well formed? If not,
> what are error conditions? (are certain characters illegal? Are lines
> longer than the block size allowed?) Do you want a callback interface
> for each block? If so, do you need an "end of file" indication? If so,
> is that a separate callback or a distinct argument to the block
> callback? etc.)
>
> 2. Again, as you have noticed: yes. And, there are analogous details
> here...
>
> 3. The expat API should only require J knowledge. There are a couple
> examples in the addons/api/expat/test/ directory named test0.ijs and
> test1.ijs
>
> I hope this helps,
>
> --
> Raul
>
> On Mon, Aug 16, 2021 at 4:23 PM Mariusz Grasko
> <[email protected]> wrote:
> >
> > Thank you for some ideas on using external parser.
> > Okay now I have 3 questions:
> > 1. Is it possible to read CSV file streaming-style (for example record by
> > record) without loading everything in memory ? Even if I use some
> external
> > parsing solution like XSLT or just write something myself in some other
> > language than J, I will end up with large CSV instead of large XML. It
> > makes no difference. The reason that I need to parse it like this, is
> that
> > there are some rows that I won't need, those would be discarded depending
> > on their field values.
> > If it is not possible I would do more work outside of J in this first
> > parser XML -> CSV.
> > 2. Is there a way to call external program for J script ? If it is
> > possible  to wait for it to finish ?
> > If it is not possible, there are definiately ways to run J from other
> > programs.
> > 3. Can someone give a little bit of pointer or on how to use api/expat
> > library ? Do I need to familiarize myself with expat (C library) or just
> > good understanding of J and reading small test in package directory
> should
> > be enough ?
> > I could send some example file like Devon McCormick suggested.
> >
> > Right now I am working through book "J:The natural language for analytic
> > computing" and playing around with problems like Project Euler, but I
> could
> > really see myself using J in serious work.
> >
> > Best regards,
> > MG
> >
> >
> > śr., 11 sie 2021 o 09:51 <[email protected]> napisał(a):
> >
> > > In similar situations -but my files are not huge- I extract what I want
> > > into flattened CSV using one or more XQuery scripts, and then load the
> CSV
> > > files with J.  The code is clean, compact and easy to maintain. For
> > > recurrent XQuery patterns, m4 occasionally comes to the rescue. Expect
> > > minor portability issues when using different XQuery processors
> > > (extensions, language level...).
> > >
> > >
> > >
> > > Never got round to SAX parsing beyond tutorials, so I cannot compare.
> > >
> > >
> > > De : Mariusz Grasko <[email protected]>
> > > À : [email protected]
> > > Sujet : [Jprogramming] Is is good idea to use J for reading large XML
> > > files ?
> > > Date : 10/08/2021 18:05:45 Europe/Paris
> > >
> > > Hi,
> > >
> > > We are ecommerce company and have a lot of integrations with suppliers,
> > > products info is nearly always in XML files. I am thinking about using
> J as
> > > an analysis tool, do you think that working with large files that need
> to
> > > be parsed SAX- style without reading everything at once is good idea
> in J ?
> > > Also is this even advantageous (as in, would code be terse). Right now
> XML
> > > parsing is done in Golang, so if parsing in J is not very good we
> could try
> > > to rely more on CSV exports. CSV is definiately very good in J.
> > > I am hoping that maybe XML parsing is very good in J and the code would
> > > become much smaller, if this is the case, then I would think about
> using J
> > > for XMLs with new suppliers.
> > >
> > > Best Regards
> > > M.G.
> > > ----------------------------------------------------------------------
> > > For information about J forums see http://www.jsoftware.com/forums.htm
> > >
> > > ----------------------------------------------------------------------
> > > For information about J forums see http://www.jsoftware.com/forums.htm
> > >
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>


-- 
John D. Baker
[email protected]
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to