Re: [Bug-apl] Performance problems when constructing large(ish) arrays

Elias Mårtenson Tue, 17 Jan 2017 11:58:56 -0800

You are right that doing array allocations in any language is slow, and in
order to get some kind of baseline to compare things to, I implemented the
same algorithm in Lisp, and I did so in the worst possible way: By
completely reallocating the array for each row, and copying the old content
into the new one (code in the end of this message). I did it that way
because that is what the APL code ends up doing, so I didn't want to give
the Lisp program any benefits.This was, as you may imagine, very slow. It
took *51* seconds to load the entire file.


The APL version that I posted in my previous message takes

Now, for a solution. Reading the file twice is quite ugly, and I'd rather
avoid doing so.

I could rewrite the Lisp program to in-place reallocation of the array,
which would speed up the program by several orders of magnitude. The
problem is that there is no mechanism in GNU APL to do the same. I think
adding a new construct to the language to do that would be ugly, but
perhaps there is a way for GNU APL to detect the specific “append to the
end of an array” idiom so that it could use an optimised path that doesn't
require constructing a new array. *Jürgen*, is that even feasible?

Now, reading a CSV file like this is a very common case, and I am of the
opinion that having a dedicated native function for this purpose would
solve perhaps 75% of use cases where this is an issue. I'll be happy to
implement a flexible function that can do this, which would make GNU APL
much more pleasant to use for data analysis.

Jürgen, if I were to implement this, would you prefer it to be a
dynamically loaded library, or would you prefer having it as a
quad-function?

Regards,
Elias

On 18 January 2017 at 02:33, Nick Lobachevsky <ibeam2...@gmail.com> wrote:

> Elias,
>
> In just about any language, iterative string catenation is a real
> performance killer owing to the repeated growing memory allocation
> which just gets bigger with every iteration.
>
> I would restructure the algorithm to work more like this:  (sorry, on
> a Windoze box and no APL, so metacode only)
>
> Open the file, read the contents of the entire file, close the file
> Change NL to separator, then partition the whole thing (Z)
> Check to see that the length of the partitioned string divides evenly
> by pattern length
> Create a temporary of only the numeric items.  As they should all be
> scalar, you should be able to do it in a single execute and get a
> numeric vector of the correct length.
> Assign the numeric vector back into Z.  You may want to do a ravel
> each if you would rather have one element vectors
> Reshape Z to have the same number columns as in pattern
>
> Dyalog and some other APLs (APLX?) have a []CSV function.  The problem
> gets a bit harder when you have optional quotes around strings,
> embedded separators, and have to figure out yourself whether something
> is numeric or char.
>
>
>
> On 1/17/17, Elias Mårtenson <loke...@gmail.com> wrote:
> > I wanted to use GNU APL to work on a dataset of star data. The file
> > consists of 34030 lines of the following form:
> >
> >   892376 3813 4.47 0.4699  1.532  0.007    7306.69 0.823 0.4503 0 ---
> >  1026146 4261 4.57 0.6472 14.891  0.12    11742.56 1.405 0.7229 0 ---
> >  1026474 4122 4.56 0.5914  1.569  0.006   30471.8  1.204 0.6061 0 ---
> >  1162635 3760 4.77 0.4497 15.678  0.019   10207.47 0.978 0.5445 1 ---
> >
> > I wrote a generic CSV loader to handle this (source code at the end of
> this
> > email), and loaded the data like so:
> >
> > *    z ← 'nnnnnnnnnns' read_csv 'apjs492452t1_mrt.txt'*
> >
> > This took many minutes to load, which in my opinion shouldn't happen.
> >
> > Now, I have a few questions:
> >
> >
> >    1. Is there a way to speed up this code?
> >    2. Is there something that could be done on the GNU APL implementation
> >    side to make this faster?
> >    3. Shouldn't we have a generic ⎕CSV function or something like that
> >    which would be able to load CSV files in milliseconds regardless of
> > size?
> >    This should be trivial to do in C++.
> >
> > Here's the code in question:
> >
> > ∇Z ← type convert_entry value
> >   →('n'≡type)/numeric
> >   →('s'≡type)/string
> >   ⎕ES 'Illegal conversion type'
> > numeric:
> >   Z←⍎value
> >   →end
> > string:
> >   Z←value
> > end:
> > ∇
> >
> > ∇Z ← pattern read_csv filename ;fd;line;separator
> >   separator ← ' '
> >   Z ← 0 (↑⍴pattern) ⍴ ⍬
> >   fd ← 'r' FIO∆fopen filename
> >
> > next:
> >   line ← FIO∆fgets fd           ⍝ Read one line from the file
> >   →(⍬≡line)/end
> >   →(10≠line[⍴line])/skip_nl     ⍝ If the line ends in a newline
> >   line ← line[⍳¯1+⍴line]        ⍝ Remove the newline
> > skip_nl:
> >   line ← ⎕UCS line
> >   Z ← Z⍪ pattern convert_entry¨ (line≠separator) ⊂ line
> >   →next
> > end:
> >
> >   FIO∆fclose fd
> > ∇
> >
>

Re: [Bug-apl] Performance problems when constructing large(ish) arrays

Reply via email to