Hi Hans, I will try that! Thank you. By the way... I notice all APL non-ascii symbols seem to be replaced by just '?' in my mailing list archive emails.. is this is an issue with my mail provider, or is the UTF-8/APL/whatever encoding of mailing list posts being lost in the daily Digests?
On Tue, 28 Dec 2021 at 13:15, <[email protected]> wrote: > Send Bug-apl mailing list submissions to > [email protected] > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.gnu.org/mailman/listinfo/bug-apl > or, via email, send a message with subject or body 'help' to > [email protected] > > You can reach the person managing the list at > [email protected] > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Bug-apl digest..." > > > Today's Topics: > > 1. Re: Absolute limits of rank 2 bool matrix size in GNU APL? > (Hans-Peter Sorge) > 2. Re: Absolute limits of rank 2 bool matrix size in GNU APL? > (Dr. J?rgen Sauermann) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Tue, 28 Dec 2021 20:42:05 +0100 > From: Hans-Peter Sorge <[email protected]> > To: [email protected] > Subject: Re: Absolute limits of rank 2 bool matrix size in GNU APL? > Message-ID: <[email protected]> > Content-Type: text/plain; charset="utf-8"; Format="flowed" > > Hi Rus, > > looks like the outer product is a not needed - you have the unique words > and along the line you got the word count too. > > you take the sorted word vector > swv ?'aa' 'bb' 'bb' 'cc' 'cc' 'ff' 'gg' > > then you create a partition vector from it > pv?+\1,~2?/swv > pv > > 1 2 2 3 3 4 5 > > partition for wc > ?????pv?pv > 1 ?2 2 ?3 3 ?4 ?5 > > Then the wc is > wc???? pv ? pv > ?????wc > > and the unique words are > uw?1?? pv ? swv > ?????uw > aa bb cc ff gg > > > finally the listing of occurrences > ??????uw,?wc > aa 1 > bb 2 > cc 2 > ff 1 > gg 1 > > > Best Regards > Hans-Peter > > Am 28.12.21 um 03:53 schrieb Russtopia: > > Hi, doing some experiments in learning APL I was writing a word > > frequency count program that takes in a document, identifies unique > > words and then outputs the top 'N' occurring words. > > > > The most straightforward solution, to me, seems to be ?.? which works > > up to a certain dataset size. The main limiting statement in my program > is > > > > wordcounts?+? (wl ?.? uniqwords) > > > > .. which generates a large boolean array which is then tallied up for > > each unique word. > > > > I seem to run into a limit in GNU APL. I do not see an obvious ?SYL > > parameter to increase the limit and could not find any obvious > > reference in the docs either. What are the absolute max rows/columns > > of a matrix, and can the limit be increased? Are they separate or a > > combined limit? > > > > 5 wcOuterProd 'corpus/135-0-5000.txt'? ? ?? 5000-line document > > Time: 26419 ms > > ? the ? of ? a and ?to > > ?2646 1348 978 879 858 > > ? ? ? ?wl > > 36564 > > ? ? ? ? uniqwords > > 5695 > > > > 5 wcOuterProd 'corpus/135-0-7500.txt'? ??? 7500-line document > > WS FULL+ > > wcOuterProd[8] ?wordcounts?+?(wl?.?uniqwords) > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ^ ? ? ? ? ? ? ^ > > ? ? ? ? wl > > 58666 > > ? ? ? ? uniqwords > > 7711 > > > > > > I have an iterative solution which doesn't use a boolean matrix to > > count the words, rather looping through using pick/take and so can > > handle much larger documents, but it takes roughy 2x the execution time. > > > > Relating to this, does GNU APL optimize boolean arrays to minimize > > storage (ie., using larger bit vectors rather than entire ints per > > bool) and is there any clever technique other experience APLers could > > suggest to maintain the elegant 'loop-free' style of computing but > > avoid generating such large bool matrices? I thought of perhaps a > > hybrid approach where I iterate through portions of the data and do > > partial ?.? passes but of course that complicates the algorithm. > > > > [my 'outer product' and 'iterative' versions of the code are below] > > > > Thanks, > > -Russ > > > > --- > > #!/usr/local/bin/apl --script > > ????????????????????????????????????????????????????????????????????? > > ? ? ? ? ? ?? > > ? wordcount.apl ? ? ? ? ? ? ? ? ? ? ? ?2021-12-26 ?20:07:07 (GMT-8) ?? > > ? ? ? ? ? ?? > > ????????????????????????????????????????????????????????????????????? > > > > ? function edif has ufun1 pointer 0! > > > > ?r ? timeMs; t > > ? t ? ?TS > > ? r ? (((t[3]?86400)+(t[4]?3600)+(t[5]?60)+(t[6]))?1000)+t[7] > > ? > > > > ?r ? lowerAndStrip s;stripped;mixedCase > > ?stripped ? ' abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz*' > > ?mixedCase ? ?av[11],' > > ,.?!;:"''()[]-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' > > ?r ? stripped[mixedCase ? s] > > ? > > > > ?c ? h wcIterative fname > > ? ??;D;WL;idx;len;word;wc;wcl;idx > > ? ?? Return ?-sorted count of unique words in string vector D, > > ignoring case and punctuation > > ? ?? @param h(?) - how many top word counts to return > > ? ?? @param D(?) - vector of words > > ? ???? > > ? D ? lowerAndStrip (?fio['read_file'] fname) ?? raw text with newlines > > ? timeStart ? timeMs > > ? D ? (~ D ? ' ') ? D ? make into a vector of words > > ? WL ? ?D > > ? ??#DEBUG# ? ? 'unique words:',WL > > ? wcl ? 0?0 > > ? idx ? 1 > > ? len ? ?WL > > count: > > ? ??#DEBUG# ? ? idx > > ? ?(idx>len)/done > > ? word ? ?idx?WL > > ? ??#DEBUG# ? ? word > > ? wc ? +/(word??D) > > ? wcl ? wcl,wc > > ? ??#DEBUG# ? ? wcl > > ? idx ? 1+idx > > ? ? count > > done: > > ? c ? h?[2] (WL)[?wcl],[0.5]wcl[?wcl] > > ? timeEnd ? timeMs > > ? ? ? 'Time:',(timeEnd-timeStart),'ms' > > ? > > > > ?r ? n wcOuterProd fname > > ? ?? ;D;wl;uniqwords;wordcounts;sortOrder > > ? D ? lowerAndStrip (?fio['read_file'] fname) ?? raw text with newlines > > ? timeStart ? timeMs > > ? wl ? (~ D ? ' ') ? D > > ? ??#DEBUG# ? ? '? wl:', ? wl > > ? uniqwords ? ?wl > > ? ??#DEBUG# ? ? '? uniqwords:', ? uniqwords > > ? wordcounts ? +?(wl ?.? uniqwords) > > ? sortOrder ? ?wordcounts > > ? r ? n?[2] uniqwords[sortOrder],[0.5]wordcounts[sortOrder] > > ? timeEnd ? timeMs > > ? ? ? 'Time:',(timeEnd-timeStart),'ms' > > ? > > > > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > https://lists.gnu.org/archive/html/bug-apl/attachments/20211228/f246993a/attachment.htm > > > > ------------------------------ > > Message: 2 > Date: Tue, 28 Dec 2021 22:15:44 +0100 > From: Dr. J?rgen Sauermann <[email protected]> > To: Blake McBride <[email protected]>, Elias M?rtenson > <[email protected]> > Cc: bug-apl <[email protected]>, Russtopia <[email protected]> > Subject: Re: Absolute limits of rank 2 bool matrix size in GNU APL? > Message-ID: > <[email protected]> > Content-Type: text/plain; charset=utf-8; format=flowed > > Hi, > > I have spent many hours to do a proper memory handling. The current > solution is not perfect but will work > in many cases. If anybody has suggestions how to improve then I'll be > happy to use them. > > The current approach is roughly this (if I remember correctly) > > 1. if the user specifies a *ulimit -v* other than *unlimited* then use > it and assume it will be available > at any time. Otherwise assume a limit of 2 GB memory. > 2. decrease the limit a little so that C++ exception handlers can > allocate their memory. > 3. during apl execution, keep track of the amunt of memory used and > released by *operator new > *(= essentially malloc), which is roughly the physical memory used by apl. > 4. When the limit is reached, raise WS FULL. > > Before that, the apl process was sometimes killed (un-catchable) before > it could raise WS FULL, > which is IMHO worse than raising WS FULL a little too early (i.e. before > the memory is really > exhausted. The current approach means in Blake's terms, that we try our > best to stay at Level 1. > You can actually go directly from Level 1 to Level 3 by creating a large > apl value, so any solution > for only one level does not really help. > > As far as I can see, the free utility is just a nicer front-end to > */proc/meminfo*. > > We cannot check */proc/meminfo* before every memory allocation, so we do > it at start-up and trust > our own memory bookkeeping. There is a little slack caused by malloc, > but our safety-margin 2. > above handles that. > > I tried earlier to depend on paging, but I was running into problems > (crashes instead of paging > even though the paging space was huge enough). > > Also, conceptually *?WA *becomes useless if it can change without > anything happening in apl > (e.g, when other processes use memory). In theory we could not only > allocate but also > use all memory beforehand, but that is not very practical if it is not > used. In the old days, > a process would know how much memory it has, but nowadays it does not. > > Best Regards, > J?rgen > > > On 12/28/21 8:25 PM, Blake McBride wrote: > > I think what you're talking about and what I am talking about are not > > exactly the same thing.? I can see what you are saying, but that is a > > contrived example.? For example: > > > > Level 1:? you are using the RAM that exists (not over-committed) > > > > Level 2:? you are using more RAM than you have causing over-commit and > > paging. > > > > Level 3:? you allocate more memory than exists in RAM?+ paging. > > > > I am talking about Levels 1 and 2.? You are talking about Level 3.? At > > least in my world, I have never hit Level 3 because Level 2 becomes > > intolerable slow and I kill the process myself.? For this reason, in > > my world, Level 3 never occurs. > > > > It doesn't make sense to handle Level 2 problems with Level 3 logic. > > > > Blake > > > > > > On Tue, Dec 28, 2021 at 1:16 PM Elias M?rtenson <[email protected] > > <mailto:[email protected]>> wrote: > > > > Actually, the default memory allocator in glibc uses mmap and not > > sbrk, if I'm not mistaken. However, they both do the same thing at > > the end of the day, and it's to request more pages from the kernel. > > > > Unfortunately, Linux malloc never returns NULL. Even if you try to > > allocate a petabyte in one allocation. You can write to this > > memory and new pages will be created as you write, and at some > > point your write will fail with a SEGV because there are no free > > page left. > > > > What's even worse is that once you start to fail, the kernel will > > randomly kill processes until it gets more free pages. If that > > sounds idiotic, that's because it is, but that's how it works. > > > > You can configure the kernel to use not do this, but unfortunately > > you're going to have other problems if you do, primarily because > > all Linux software is written with the default behaviour in mind. > > > > You can try it right now. Write a C program that allocates a few > > TB of RAM and see what happens. Actually, nothing will happen > > until you start writing to this memory. > > > > Den ons 29 dec. 2021 03:09Blake McBride <[email protected] > > <mailto:[email protected]>> skrev: > > > > Hi J?rgen, > > > > First, to be sure you know where I'm coming from.? I know > > malloc() and sbrk() pretty well.? I've written malloc's?before. > > > > It is utter news to me that malloc would return a pointer that > > you can't use - even in over-commit?situations.? > > Theoretically, in over-commit situations, if you access a > > pointer that's paged out, the system should page it in first.? > > If that doesn't happen, that's a bug in the OS.? I'd create a > > simple program that mimics this behavior and submit it as a > > bug report.? (What I almost always find is that during the > > process of creating the sample program, I find the actual bug > > in my own program.) > > > > What I'd try to do is write my own malloc that does the > following: > > > > 1. Checks for the amount of real memory available using > > something I took from the /free/ command. > > > > 2. If there is enough memory, call the real malloc and return it. > > > > 3. If there is not enough real memory, rather than calling > > malloc and causing paging you can just return a null. > > > > Of course, I understand that sbrk() adjusts the system > > breakpoint essentially?giving you more heap space. Malloc > > manages that free space that's already been taken from the OS > > via sbrk(). > > > > With regard to item #1, you'd have to have an intelligent way > > of determining when to calculate the true available space.? > > You can't do it for 16 bytes, and you don't want to do it for > > every allocation.? On the other hand for large allocations or > > if you haven't queried the system for some time, it makes > > sense to update your opinion of the available space. > > > > Thanks. > > > > Blake > > > > > > > > > > On Tue, Dec 28, 2021 at 12:50 PM Dr. J?rgen Sauermann > > <mail@j?rgen-sauermann.de > > <mailto:mail@j%C3%BCrgen-sauermann.de>> wrote: > > > > Hi Blake, > > > > I know. The problem is that other processes may, just as > > GNU APL does, allocate memory > > successfully via malloc(). malloc, on "success" returns a > > non-zero virtual memory address. > > > > The fact that malloc returns a non-zero address (= > > indicating malloc() success) does unfortunately > > not guarantee that you would be able to access that > > memory. You may or may > > not get a page fault when accessing the memory. The point > > in time when you call 'free' or > > similar tools (which basically read /proc/memory tell you > > the situation at that time which can > > be very different from the situation when you fill your > > virtual memory with real data (which > > assigns a physical memory page to the successfully > > allocated virtual memory page) and > > that may fail. Even worse, when this happen then your > > machine has no more memory and > > the C++ exception handlers will also fails due to lack of > > memory and the process is terminated > > without any chance to raise a WS full or even remain in apl. > > > > Best Regards, > > J?rgen > > > > > > On 12/28/21 6:52 PM, Blake McBride wrote: > >> Since the Linux "free" command knows the difference > >> between how much real memory vs. virtual memory is > >> available, it may be useful to use a similar method. > >> > >> Blake > >> > >> > >> On Tue, Dec 28, 2021 at 11:34 AM Dr. J?rgen Sauermann > >> <mail@j?rgen-sauermann.de > >> <mailto:mail@j%C3%BCrgen-sauermann.de>> wrote: > >> > >> Hi Russ, > >> > >> it has turned out to be very difficult to find a > >> reliable way to figure the memory that is available for > >> a process like the GNU APL interpreter. One (of > >> several) reasons for that is that GNU linux tells you > >> that it has more memory than it really has (so-called > >> over-commitment). You can malloc() more memory > >> than you have and malloc will return a virtual memory > >> (and no error) when you ask for it. However, if you > >> access the virtual memory at a later point in time > >> (i.e. requesting real memory for it) then the process > >> may crash badly if that real memory is not available. > >> > >> GNU APL deals with this by assuming (by default) that > >> the total memory available for the GNU APL process is > >> about 2 GB, > >> > >> You can increase that amount (outside apl) be setting > >> a larger memory limit, probably with: > >> > >> *ulimit --virtual-memory-size*or *ulimit -v > >> *(depending on platform). > >> > >> in the shell or script before starting apl. However, > >> note that: > >> > >> * *ulimit -v* *ulimited* will not work because this > >> is the default and GNU APL will apply its default of > >> 2 GB in this case, > >> * the WS FULL behaviour of GNU APL (ans ?WA) will > >> become unreliable. You must ensure that GNU APL will get > >> ? as much memory as you promised it with ulimit, and > >> * all GNU APL cells have the same size (e.g. 24 byte > >> on a 64-bit CPU, see *apl -l37*) even if the cells > >> contain only > >> ?Booleans. > >> > >> The workaround for really large Boolean arrays is to > >> pack them into the 64-bits of a GNU APL integer > >> and maybe use the bitwise functions (??, ??, ...) of > >> GNU APL to access them group-wise. > >> > >> Best Regards, > >> J?rgen > >> > >> > >> On 12/28/21 3:53 AM, Russtopia wrote: > >>> Hi, doing some experiments in learning APL I was > >>> writing a word frequency count program that takes in > >>> a document, identifies unique words and then outputs > >>> the top 'N' occurring words. > >>> > >>> The most straightforward solution, to me, seems to > >>> be ?.? which works up to a certain dataset size. The > >>> main limiting statement in my program is > >>> > >>> wordcounts?+? (wl ?.? uniqwords) > >>> > >>> .. which generates a large boolean array which is > >>> then tallied up for each unique word. > >>> > >>> I seem to run into a limit in GNU APL. I do not see > >>> an obvious ?SYL parameter to increase the limit and > >>> could not find any obvious reference in the docs > >>> either. What are the absolute max rows/columns of a > >>> matrix, and can the limit be increased? Are they > >>> separate or a combined limit? > >>> > >>> ? ? ? 5 wcOuterProd 'corpus/135-0-5000.txt'? ? ?? > >>> 5000-line document > >>> Time: 26419 ms > >>> ? the ? of ? a and ?to > >>> ?2646 1348 978 879 858 > >>> ? ? ? ?wl > >>> 36564 > >>> ? ? ? ? uniqwords > >>> 5695 > >>> > >>> ? ? ? 5 wcOuterProd 'corpus/135-0-7500.txt'? ??? > >>> 7500-line document > >>> WS FULL+ > >>> wcOuterProd[8] ?wordcounts?+?(wl?.?uniqwords) > >>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ^ ? ? ? ? ? ^ > >>> ? ? ? ? wl > >>> 58666 > >>> ? ? ? ? uniqwords > >>> 7711 > >>> > >>> > >>> I have an iterative solution which doesn't use a > >>> boolean matrix to count the words, rather looping > >>> through using pick/take and so can handle much > >>> larger documents, but it takes roughy 2x the > >>> execution time. > >>> > >>> Relating to this, does GNU APL optimize boolean > >>> arrays to minimize storage (ie., using larger bit > >>> vectors rather than entire ints per bool) and is > >>> there any clever technique other experience APLers > >>> could suggest to maintain the elegant 'loop-free' > >>> style of computing but avoid generating such large > >>> bool matrices? I thought of perhaps a hybrid > >>> approach where I iterate through portions of the > >>> data and do partial ?.? passes but of course that > >>> complicates the algorithm. > >>> > >>> [my 'outer product' and 'iterative' versions of the > >>> code are below] > >>> > >>> Thanks, > >>> -Russ > >>> > >>> --- > >>> #!/usr/local/bin/apl --script > >>> > ????????????????????????????????????????????????????????????????????? > >>> ? ?? > >>> ? wordcount.apl ? ? ?2021-12-26 ?20:07:07 (GMT-8) ?? > >>> ? ?? > >>> > ????????????????????????????????????????????????????????????????????? > >>> > >>> ? function edif has ufun1 pointer 0! > >>> > >>> ?r ? timeMs; t > >>> ? t ? ?TS > >>> ? r ? > >>> (((t[3]?86400)+(t[4]?3600)+(t[5]?60)+(t[6]))?1000)+t[7] > >>> ? > >>> > >>> ?r ? lowerAndStrip s;stripped;mixedCase > >>> ?stripped ? ' > >>> abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz*' > >>> ?mixedCase ? ?av[11],' > >>> > ,.?!;:"''()[]-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' > >>> ?r ? stripped[mixedCase ? s] > >>> ? > >>> > >>> ?c ? h wcIterative fname > >>> ? ??;D;WL;idx;len;word;wc;wcl;idx > >>> ? ?? Return ?-sorted count of unique words in string > >>> vector D, ignoring case and punctuation > >>> ? ?? @param h(?) - how many top word counts to return > >>> ? ?? @param D(?) - vector of words > >>> ? ???? > >>> ? D ? lowerAndStrip (?fio['read_file'] fname) ?? raw > >>> text with newlines > >>> ? timeStart ? timeMs > >>> ? D ? (~ D ? ' ') ? D ? make into a vector of words > >>> ? WL ? ?D > >>> ? ??#DEBUG# ? ? 'unique words:',WL > >>> ? wcl ? 0?0 > >>> ? idx ? 1 > >>> ? len ? ?WL > >>> count: > >>> ? ??#DEBUG# ? ? idx > >>> ? ?(idx>len)/done > >>> ? word ? ?idx?WL > >>> ? ??#DEBUG# ? ? word > >>> ? wc ? +/(word??D) > >>> ? wcl ? wcl,wc > >>> ? ??#DEBUG# ? ? wcl > >>> ? idx ? 1+idx > >>> ? ? count > >>> done: > >>> ? c ? h?[2] (WL)[?wcl],[0.5]wcl[?wcl] > >>> ? timeEnd ? timeMs > >>> ? ? ? 'Time:',(timeEnd-timeStart),'ms' > >>> ? > >>> > >>> ?r ? n wcOuterProd fname > >>> ? ?? ;D;wl;uniqwords;wordcounts;sortOrder > >>> ? D ? lowerAndStrip (?fio['read_file'] fname) ?? raw > >>> text with newlines > >>> ? timeStart ? timeMs > >>> ? wl ? (~ D ? ' ') ? D > >>> ? ??#DEBUG# ? ? '? wl:', ? wl > >>> ? uniqwords ? ?wl > >>> ? ??#DEBUG# ? ? '? uniqwords:', ? uniqwords > >>> ? wordcounts ? +?(wl ?.? uniqwords) > >>> ? sortOrder ? ?wordcounts > >>> ? r ? n?[2] > >>> uniqwords[sortOrder],[0.5]wordcounts[sortOrder] > >>> ? timeEnd ? timeMs > >>> ? ? ? 'Time:',(timeEnd-timeStart),'ms' > >>> ? > >>> > >>> > >> > > > > > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > Bug-apl mailing list > [email protected] > https://lists.gnu.org/mailman/listinfo/bug-apl > > > ------------------------------ > > End of Bug-apl Digest, Vol 99, Issue 13 > *************************************** >
