Hi Hans, I will try that! Thank you.

By the way... I notice all APL non-ascii symbols seem to be replaced by
just '?' in my mailing list archive emails.. is this is an issue with my
mail provider, or is the UTF-8/APL/whatever encoding of mailing list posts
being lost in the daily Digests?

On Tue, 28 Dec 2021 at 13:15, <[email protected]> wrote:

> Send Bug-apl mailing list submissions to
>         [email protected]
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         https://lists.gnu.org/mailman/listinfo/bug-apl
> or, via email, send a message with subject or body 'help' to
>         [email protected]
>
> You can reach the person managing the list at
>         [email protected]
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Bug-apl digest..."
>
>
> Today's Topics:
>
>    1. Re: Absolute limits of rank 2 bool matrix size in GNU APL?
>       (Hans-Peter Sorge)
>    2. Re: Absolute limits of rank 2 bool matrix size in GNU APL?
>       (Dr. J?rgen Sauermann)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Tue, 28 Dec 2021 20:42:05 +0100
> From: Hans-Peter Sorge <[email protected]>
> To: [email protected]
> Subject: Re: Absolute limits of rank 2 bool matrix size in GNU APL?
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset="utf-8"; Format="flowed"
>
> Hi Rus,
>
> looks like the outer product is a not needed - you have the unique words
> and along the line you got the word count too.
>
> you take the sorted word vector
> swv ?'aa' 'bb' 'bb' 'cc' 'cc' 'ff' 'gg'
>
> then you create a partition vector from it
> pv?+\1,~2?/swv
> pv
>
> 1 2 2 3 3 4 5
>
> partition for wc
>  ?????pv?pv
> 1 ?2 2 ?3 3 ?4 ?5
>
> Then the wc is
> wc???? pv ? pv
>  ?????wc
>
> and the unique words are
> uw?1?? pv ? swv
>  ?????uw
> aa bb cc ff gg
>
>
> finally the listing of occurrences
>  ??????uw,?wc
> aa 1
> bb 2
> cc 2
> ff 1
> gg 1
>
>
> Best Regards
> Hans-Peter
>
> Am 28.12.21 um 03:53 schrieb Russtopia:
> > Hi, doing some experiments in learning APL I was writing a word
> > frequency count program that takes in a document, identifies unique
> > words and then outputs the top 'N' occurring words.
> >
> > The most straightforward solution, to me, seems to be ?.? which works
> > up to a certain dataset size. The main limiting statement in my program
> is
> >
> > wordcounts?+? (wl ?.? uniqwords)
> >
> > .. which generates a large boolean array which is then tallied up for
> > each unique word.
> >
> > I seem to run into a limit in GNU APL. I do not see an obvious ?SYL
> > parameter to increase the limit and could not find any obvious
> > reference in the docs either. What are the absolute max rows/columns
> > of a matrix, and can the limit be increased? Are they separate or a
> > combined limit?
> >
> > 5 wcOuterProd 'corpus/135-0-5000.txt'? ? ?? 5000-line document
> > Time: 26419 ms
> > ? the ? of ? a and ?to
> > ?2646 1348 978 879 858
> > ? ? ? ?wl
> > 36564
> > ? ? ? ? uniqwords
> > 5695
> >
> > 5 wcOuterProd 'corpus/135-0-7500.txt'? ??? 7500-line document
> > WS FULL+
> > wcOuterProd[8] ?wordcounts?+?(wl?.?uniqwords)
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ^ ? ? ? ? ? ? ^
> > ? ? ? ? wl
> > 58666
> > ? ? ? ? uniqwords
> > 7711
> >
> >
> > I have an iterative solution which doesn't use a boolean matrix to
> > count the words, rather looping through using pick/take and so can
> > handle much larger documents, but it takes roughy 2x the execution time.
> >
> > Relating to this, does GNU APL optimize boolean arrays to minimize
> > storage (ie., using larger bit vectors rather than entire ints per
> > bool) and is there any clever technique other experience APLers could
> > suggest to maintain the elegant 'loop-free' style of computing but
> > avoid generating such large bool matrices? I thought of perhaps a
> > hybrid approach where I iterate through portions of the data and do
> > partial ?.? passes but of course that complicates the algorithm.
> >
> > [my 'outer product' and 'iterative' versions of the code are below]
> >
> > Thanks,
> > -Russ
> >
> > ---
> > #!/usr/local/bin/apl --script
> > ?????????????????????????????????????????????????????????????????????
> > ? ? ? ? ? ??
> > ? wordcount.apl ? ? ? ? ? ? ? ? ? ? ? ?2021-12-26 ?20:07:07 (GMT-8) ??
> > ? ? ? ? ? ??
> > ?????????????????????????????????????????????????????????????????????
> >
> > ? function edif has ufun1 pointer 0!
> >
> > ?r ? timeMs; t
> > ? t ? ?TS
> > ? r ? (((t[3]?86400)+(t[4]?3600)+(t[5]?60)+(t[6]))?1000)+t[7]
> > ?
> >
> > ?r ? lowerAndStrip s;stripped;mixedCase
> > ?stripped ? ' abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz*'
> > ?mixedCase ? ?av[11],'
> > ,.?!;:"''()[]-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
> > ?r ? stripped[mixedCase ? s]
> > ?
> >
> > ?c ? h wcIterative fname
> > ? ??;D;WL;idx;len;word;wc;wcl;idx
> > ? ?? Return ?-sorted count of unique words in string vector D,
> > ignoring case and punctuation
> > ? ?? @param h(?) - how many top word counts to return
> > ? ?? @param D(?) - vector of words
> > ? ????
> > ? D ? lowerAndStrip (?fio['read_file'] fname) ?? raw text with newlines
> > ? timeStart ? timeMs
> > ? D ? (~ D ? ' ') ? D ? make into a vector of words
> > ? WL ? ?D
> > ? ??#DEBUG# ? ? 'unique words:',WL
> > ? wcl ? 0?0
> > ? idx ? 1
> > ? len ? ?WL
> > count:
> > ? ??#DEBUG# ? ? idx
> > ? ?(idx>len)/done
> > ? word ? ?idx?WL
> > ? ??#DEBUG# ? ? word
> > ? wc ? +/(word??D)
> > ? wcl ? wcl,wc
> > ? ??#DEBUG# ? ? wcl
> > ? idx ? 1+idx
> > ? ? count
> > done:
> > ? c ? h?[2] (WL)[?wcl],[0.5]wcl[?wcl]
> > ? timeEnd ? timeMs
> > ? ? ? 'Time:',(timeEnd-timeStart),'ms'
> > ?
> >
> > ?r ? n wcOuterProd fname
> > ? ?? ;D;wl;uniqwords;wordcounts;sortOrder
> > ? D ? lowerAndStrip (?fio['read_file'] fname) ?? raw text with newlines
> > ? timeStart ? timeMs
> > ? wl ? (~ D ? ' ') ? D
> > ? ??#DEBUG# ? ? '? wl:', ? wl
> > ? uniqwords ? ?wl
> > ? ??#DEBUG# ? ? '? uniqwords:', ? uniqwords
> > ? wordcounts ? +?(wl ?.? uniqwords)
> > ? sortOrder ? ?wordcounts
> > ? r ? n?[2] uniqwords[sortOrder],[0.5]wordcounts[sortOrder]
> > ? timeEnd ? timeMs
> > ? ? ? 'Time:',(timeEnd-timeStart),'ms'
> > ?
> >
> >
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> https://lists.gnu.org/archive/html/bug-apl/attachments/20211228/f246993a/attachment.htm
> >
>
> ------------------------------
>
> Message: 2
> Date: Tue, 28 Dec 2021 22:15:44 +0100
> From: Dr. J?rgen Sauermann <[email protected]>
> To: Blake McBride <[email protected]>, Elias M?rtenson
>         <[email protected]>
> Cc: bug-apl <[email protected]>, Russtopia <[email protected]>
> Subject: Re: Absolute limits of rank 2 bool matrix size in GNU APL?
> Message-ID:
>         <[email protected]>
> Content-Type: text/plain; charset=utf-8; format=flowed
>
> Hi,
>
> I have spent many hours to do a proper memory handling. The current
> solution is not perfect but will work
> in many cases. If anybody has suggestions how to improve then I'll be
> happy to use them.
>
> The current approach is roughly this (if I remember correctly)
>
> 1. if the user specifies a *ulimit -v* other than *unlimited* then use
> it and assume it will be available
> at any time. Otherwise assume a limit of 2 GB memory.
> 2. decrease the limit a little so that C++ exception handlers can
> allocate their memory.
> 3. during apl execution, keep track of the amunt of memory used and
> released by *operator new
> *(= essentially malloc), which is roughly the physical memory used by apl.
> 4. When the limit is reached, raise WS FULL.
>
> Before that, the apl process was sometimes killed (un-catchable) before
> it could raise WS FULL,
> which is IMHO worse than raising WS FULL a little too early (i.e. before
> the memory is really
> exhausted. The current approach means in Blake's terms, that we try our
> best to stay at Level 1.
> You can actually go directly from Level 1 to Level 3 by creating a large
> apl value, so any solution
> for only one level does not really help.
>
> As far as I can see, the free utility is just a nicer front-end to
> */proc/meminfo*.
>
> We cannot check */proc/meminfo* before every memory allocation, so we do
> it at start-up and trust
> our own memory bookkeeping. There is a little slack caused by malloc,
> but our safety-margin 2.
> above handles that.
>
> I tried earlier to depend on paging, but I was running into problems
> (crashes instead of paging
> even though the paging space was huge enough).
>
> Also, conceptually *?WA *becomes useless if it can change without
> anything happening in apl
> (e.g, when other processes use memory). In theory we could not only
> allocate but also
> use all memory beforehand, but that is not very practical if it is not
> used. In the old days,
> a process would know how much memory it has, but nowadays it does not.
>
> Best Regards,
> J?rgen
>
>
> On 12/28/21 8:25 PM, Blake McBride wrote:
> > I think what you're talking about and what I am talking about are not
> > exactly the same thing.? I can see what you are saying, but that is a
> > contrived example.? For example:
> >
> > Level 1:? you are using the RAM that exists (not over-committed)
> >
> > Level 2:? you are using more RAM than you have causing over-commit and
> > paging.
> >
> > Level 3:? you allocate more memory than exists in RAM?+ paging.
> >
> > I am talking about Levels 1 and 2.? You are talking about Level 3.? At
> > least in my world, I have never hit Level 3 because Level 2 becomes
> > intolerable slow and I kill the process myself.? For this reason, in
> > my world, Level 3 never occurs.
> >
> > It doesn't make sense to handle Level 2 problems with Level 3 logic.
> >
> > Blake
> >
> >
> > On Tue, Dec 28, 2021 at 1:16 PM Elias M?rtenson <[email protected]
> > <mailto:[email protected]>> wrote:
> >
> >     Actually, the default memory allocator in glibc uses mmap and not
> >     sbrk, if I'm not mistaken. However, they both do the same thing at
> >     the end of the day, and it's to request more pages from the kernel.
> >
> >     Unfortunately, Linux malloc never returns NULL. Even if you try to
> >     allocate a petabyte in one allocation. You can write to this
> >     memory and new pages will be created as you write, and at some
> >     point your write will fail with a SEGV because there are no free
> >     page left.
> >
> >     What's even worse is that once you start to fail, the kernel will
> >     randomly kill processes until it gets more free pages. If that
> >     sounds idiotic, that's because it is, but that's how it works.
> >
> >     You can configure the kernel to use not do this, but unfortunately
> >     you're going to have other problems if you do, primarily because
> >     all Linux software is written with the default behaviour in mind.
> >
> >     You can try it right now. Write a C program that allocates a few
> >     TB of RAM and see what happens. Actually, nothing will happen
> >     until you start writing to this memory.
> >
> >     Den ons 29 dec. 2021 03:09Blake McBride <[email protected]
> >     <mailto:[email protected]>> skrev:
> >
> >         Hi J?rgen,
> >
> >         First, to be sure you know where I'm coming from.? I know
> >         malloc() and sbrk() pretty well.? I've written malloc's?before.
> >
> >         It is utter news to me that malloc would return a pointer that
> >         you can't use - even in over-commit?situations.?
> >         Theoretically, in over-commit situations, if you access a
> >         pointer that's paged out, the system should page it in first.?
> >         If that doesn't happen, that's a bug in the OS.? I'd create a
> >         simple program that mimics this behavior and submit it as a
> >         bug report.? (What I almost always find is that during the
> >         process of creating the sample program, I find the actual bug
> >         in my own program.)
> >
> >         What I'd try to do is write my own malloc that does the
> following:
> >
> >         1. Checks for the amount of real memory available using
> >         something I took from the /free/ command.
> >
> >         2. If there is enough memory, call the real malloc and return it.
> >
> >         3. If there is not enough real memory, rather than calling
> >         malloc and causing paging you can just return a null.
> >
> >         Of course, I understand that sbrk() adjusts the system
> >         breakpoint essentially?giving you more heap space. Malloc
> >         manages that free space that's already been taken from the OS
> >         via sbrk().
> >
> >         With regard to item #1, you'd have to have an intelligent way
> >         of determining when to calculate the true available space.?
> >         You can't do it for 16 bytes, and you don't want to do it for
> >         every allocation.? On the other hand for large allocations or
> >         if you haven't queried the system for some time, it makes
> >         sense to update your opinion of the available space.
> >
> >         Thanks.
> >
> >         Blake
> >
> >
> >
> >
> >         On Tue, Dec 28, 2021 at 12:50 PM Dr. J?rgen Sauermann
> >         <mail@j?rgen-sauermann.de
> >         <mailto:mail@j%C3%BCrgen-sauermann.de>> wrote:
> >
> >             Hi Blake,
> >
> >             I know. The problem is that other processes may, just as
> >             GNU APL does, allocate memory
> >             successfully via malloc(). malloc, on "success" returns a
> >             non-zero virtual memory address.
> >
> >             The fact that malloc returns a non-zero address (=
> >             indicating malloc() success) does unfortunately
> >             not guarantee that you would be able to access that
> >             memory. You may or may
> >             not get a page fault when accessing the memory. The point
> >             in time when you call 'free' or
> >             similar tools (which basically read /proc/memory tell you
> >             the situation at that time which can
> >             be very different from the situation when you fill your
> >             virtual memory with real data (which
> >             assigns a physical memory page to the successfully
> >             allocated virtual memory page) and
> >             that may fail. Even worse, when this happen then your
> >             machine has no more memory and
> >             the C++ exception handlers will also fails due to lack of
> >             memory and the process is terminated
> >             without any chance to raise a WS full or even remain in apl.
> >
> >             Best Regards,
> >             J?rgen
> >
> >
> >             On 12/28/21 6:52 PM, Blake McBride wrote:
> >>             Since the Linux "free" command knows the difference
> >>             between how much real memory vs. virtual memory is
> >>             available, it may be useful to use a similar method.
> >>
> >>             Blake
> >>
> >>
> >>             On Tue, Dec 28, 2021 at 11:34 AM Dr. J?rgen Sauermann
> >>             <mail@j?rgen-sauermann.de
> >>             <mailto:mail@j%C3%BCrgen-sauermann.de>> wrote:
> >>
> >>                 Hi Russ,
> >>
> >>                 it has turned out to be very difficult to find a
> >>                 reliable way to figure the memory that is available for
> >>                 a process like the GNU APL interpreter. One (of
> >>                 several) reasons for that is that GNU linux tells you
> >>                 that it has more memory than it really has (so-called
> >>                 over-commitment). You can malloc() more memory
> >>                 than you have and malloc will return a virtual memory
> >>                 (and no error) when you ask for it. However, if you
> >>                 access the virtual memory at a later point in time
> >>                 (i.e. requesting real memory for it) then the process
> >>                 may crash badly if that real memory is not available.
> >>
> >>                 GNU APL deals with this by assuming (by default) that
> >>                 the total memory available for the GNU APL process is
> >>                 about 2 GB,
> >>
> >>                 You can increase that amount (outside apl) be setting
> >>                 a larger memory limit, probably with:
> >>
> >>                 *ulimit --virtual-memory-size*or *ulimit -v
> >>                 *(depending on platform).
> >>
> >>                 in the shell or script before starting apl. However,
> >>                 note that:
> >>
> >>                 * *ulimit -v* *ulimited* will not work because this
> >>                 is the default and GNU APL will apply its default of
> >>                 2 GB in this case,
> >>                 * the WS FULL behaviour of GNU APL (ans ?WA) will
> >>                 become unreliable. You must ensure that GNU APL will get
> >>                 ? as much memory as you promised it with ulimit, and
> >>                 * all GNU APL cells have the same size (e.g. 24 byte
> >>                 on a 64-bit CPU, see *apl -l37*) even if the cells
> >>                 contain only
> >>                 ?Booleans.
> >>
> >>                 The workaround for really large Boolean arrays is to
> >>                 pack them into the 64-bits of a GNU APL integer
> >>                 and maybe use the bitwise functions (??, ??, ...) of
> >>                 GNU APL to access them group-wise.
> >>
> >>                 Best Regards,
> >>                 J?rgen
> >>
> >>
> >>                 On 12/28/21 3:53 AM, Russtopia wrote:
> >>>                 Hi, doing some experiments in learning APL I was
> >>>                 writing a word frequency count program that takes in
> >>>                 a document, identifies unique words and then outputs
> >>>                 the top 'N' occurring words.
> >>>
> >>>                 The most straightforward solution, to me, seems to
> >>>                 be ?.? which works up to a certain dataset size. The
> >>>                 main limiting statement in my program is
> >>>
> >>>                 wordcounts?+? (wl ?.? uniqwords)
> >>>
> >>>                 .. which generates a large boolean array which is
> >>>                 then tallied up for each unique word.
> >>>
> >>>                 I seem to run into a limit in GNU APL. I do not see
> >>>                 an obvious ?SYL parameter to increase the limit and
> >>>                 could not find any obvious reference in the docs
> >>>                 either. What are the absolute max rows/columns of a
> >>>                 matrix, and can the limit be increased? Are they
> >>>                 separate or a combined limit?
> >>>
> >>>                 ? ? ? 5 wcOuterProd 'corpus/135-0-5000.txt'? ? ??
> >>>                 5000-line document
> >>>                 Time: 26419 ms
> >>>                 ? the ? of ? a and ?to
> >>>                 ?2646 1348 978 879 858
> >>>                 ? ? ? ?wl
> >>>                 36564
> >>>                 ? ? ? ? uniqwords
> >>>                 5695
> >>>
> >>>                 ? ? ? 5 wcOuterProd 'corpus/135-0-7500.txt'? ???
> >>>                 7500-line document
> >>>                 WS FULL+
> >>>                 wcOuterProd[8] ?wordcounts?+?(wl?.?uniqwords)
> >>>                 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ^ ? ? ? ? ? ^
> >>>                 ? ? ? ? wl
> >>>                 58666
> >>>                 ? ? ? ? uniqwords
> >>>                 7711
> >>>
> >>>
> >>>                 I have an iterative solution which doesn't use a
> >>>                 boolean matrix to count the words, rather looping
> >>>                 through using pick/take and so can handle much
> >>>                 larger documents, but it takes roughy 2x the
> >>>                 execution time.
> >>>
> >>>                 Relating to this, does GNU APL optimize boolean
> >>>                 arrays to minimize storage (ie., using larger bit
> >>>                 vectors rather than entire ints per bool) and is
> >>>                 there any clever technique other experience APLers
> >>>                 could suggest to maintain the elegant 'loop-free'
> >>>                 style of computing but avoid generating such large
> >>>                 bool matrices? I thought of perhaps a hybrid
> >>>                 approach where I iterate through portions of the
> >>>                 data and do partial ?.? passes but of course that
> >>>                 complicates the algorithm.
> >>>
> >>>                 [my 'outer product' and 'iterative' versions of the
> >>>                 code are below]
> >>>
> >>>                 Thanks,
> >>>                 -Russ
> >>>
> >>>                 ---
> >>>                 #!/usr/local/bin/apl --script
> >>>
>  ?????????????????????????????????????????????????????????????????????
> >>>                 ? ??
> >>>                 ? wordcount.apl ? ? ?2021-12-26 ?20:07:07 (GMT-8) ??
> >>>                 ? ??
> >>>
>  ?????????????????????????????????????????????????????????????????????
> >>>
> >>>                 ? function edif has ufun1 pointer 0!
> >>>
> >>>                 ?r ? timeMs; t
> >>>                 ? t ? ?TS
> >>>                 ? r ?
> >>>                 (((t[3]?86400)+(t[4]?3600)+(t[5]?60)+(t[6]))?1000)+t[7]
> >>>                 ?
> >>>
> >>>                 ?r ? lowerAndStrip s;stripped;mixedCase
> >>>                 ?stripped ? '
> >>>                 abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz*'
> >>>                 ?mixedCase ? ?av[11],'
> >>>
>  ,.?!;:"''()[]-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
> >>>                 ?r ? stripped[mixedCase ? s]
> >>>                 ?
> >>>
> >>>                 ?c ? h wcIterative fname
> >>>                 ? ??;D;WL;idx;len;word;wc;wcl;idx
> >>>                 ? ?? Return ?-sorted count of unique words in string
> >>>                 vector D, ignoring case and punctuation
> >>>                 ? ?? @param h(?) - how many top word counts to return
> >>>                 ? ?? @param D(?) - vector of words
> >>>                 ? ????
> >>>                 ? D ? lowerAndStrip (?fio['read_file'] fname) ?? raw
> >>>                 text with newlines
> >>>                 ? timeStart ? timeMs
> >>>                 ? D ? (~ D ? ' ') ? D ? make into a vector of words
> >>>                 ? WL ? ?D
> >>>                 ? ??#DEBUG# ? ? 'unique words:',WL
> >>>                 ? wcl ? 0?0
> >>>                 ? idx ? 1
> >>>                 ? len ? ?WL
> >>>                 count:
> >>>                 ? ??#DEBUG# ? ? idx
> >>>                 ? ?(idx>len)/done
> >>>                 ? word ? ?idx?WL
> >>>                 ? ??#DEBUG# ? ? word
> >>>                 ? wc ? +/(word??D)
> >>>                 ? wcl ? wcl,wc
> >>>                 ? ??#DEBUG# ? ? wcl
> >>>                 ? idx ? 1+idx
> >>>                 ? ? count
> >>>                 done:
> >>>                 ? c ? h?[2] (WL)[?wcl],[0.5]wcl[?wcl]
> >>>                 ? timeEnd ? timeMs
> >>>                 ? ? ? 'Time:',(timeEnd-timeStart),'ms'
> >>>                 ?
> >>>
> >>>                 ?r ? n wcOuterProd fname
> >>>                 ? ?? ;D;wl;uniqwords;wordcounts;sortOrder
> >>>                 ? D ? lowerAndStrip (?fio['read_file'] fname) ?? raw
> >>>                 text with newlines
> >>>                 ? timeStart ? timeMs
> >>>                 ? wl ? (~ D ? ' ') ? D
> >>>                 ? ??#DEBUG# ? ? '? wl:', ? wl
> >>>                 ? uniqwords ? ?wl
> >>>                 ? ??#DEBUG# ? ? '? uniqwords:', ? uniqwords
> >>>                 ? wordcounts ? +?(wl ?.? uniqwords)
> >>>                 ? sortOrder ? ?wordcounts
> >>>                 ? r ? n?[2]
> >>>                 uniqwords[sortOrder],[0.5]wordcounts[sortOrder]
> >>>                 ? timeEnd ? timeMs
> >>>                 ? ? ? 'Time:',(timeEnd-timeStart),'ms'
> >>>                 ?
> >>>
> >>>
> >>
> >
>
>
>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> Bug-apl mailing list
> [email protected]
> https://lists.gnu.org/mailman/listinfo/bug-apl
>
>
> ------------------------------
>
> End of Bug-apl Digest, Vol 99, Issue 13
> ***************************************
>

Reply via email to