Re: Absolute limits of rank 2 bool matrix size in GNU APL?

Hans-Peter Sorge Tue, 28 Dec 2021 11:53:39 -0800

Hi Rus,

looks like the outer product is a not needed - you have the unique wordsand along the line you got the word count too.


you take the sorted word vector
swv ←'aa' 'bb' 'bb' 'cc' 'cc' 'ff' 'gg'

then you create a partition vector from it
pv←+\1,~2≡/swv
pv

1 2 2 3 3 4 5

partition for wc
     pv⊂pv
1  2 2  3 3  4  5

Then the wc is
wc←∊⍴¨ pv ⊂ pv
     wc

and the unique words are
uw←1⊃¨ pv ⊂ swv
     uw
aa bb cc ff gg


finally the listing of occurrences
     ⊃uw,¨wc
aa 1
bb 2
cc 2
ff 1
gg 1


Best Regards
Hans-Peter

Am 28.12.21 um 03:53 schrieb Russtopia:

Hi, doing some experiments in learning APL I was writing a wordfrequency count program that takes in a document, identifies uniquewords and then outputs the top 'N' occurring words.
The most straightforward solution, to me, seems to be ∘.≡ which worksup to a certain dataset size. The main limiting statement in my program is
wordcounts←+⌿ (wl ∘.≡ uniqwords)
.. which generates a large boolean array which is then tallied up foreach unique word.
I seem to run into a limit in GNU APL. I do not see an obvious ⎕SYLparameter to increase the limit and could not find any obviousreference in the docs either. What are the absolute max rows/columnsof a matrix, and can the limit be increased? Are they separate or acombined limit?
5 wcOuterProd 'corpus/135-0-5000.txt'    ⍝⍝ 5000-line document
Time: 26419 ms
  the   of   a and  to
 2646 1348 978 879 858
      ⍴wl
36564
      ⍴ uniqwords
5695

5 wcOuterProd 'corpus/135-0-7500.txt'   ⍝⍝ 7500-line document
WS FULL+
wcOuterProd[8]  wordcounts←+⌿(wl∘.≡uniqwords)
                              ^             ^
      ⍴ wl
58666
      ⍴ uniqwords
7711
I have an iterative solution which doesn't use a boolean matrix tocount the words, rather looping through using pick/take and so canhandle much larger documents, but it takes roughy 2x the execution time.
Relating to this, does GNU APL optimize boolean arrays to minimizestorage (ie., using larger bit vectors rather than entire ints perbool) and is there any clever technique other experience APLers couldsuggest to maintain the elegant 'loop-free' style of computing butavoid generating such large bool matrices? I thought of perhaps ahybrid approach where I iterate through portions of the data and dopartial ∘.≡ passes but of course that complicates the algorithm.
[my 'outer product' and 'iterative' versions of the code are below]

Thanks,
-Russ

---
#!/usr/local/bin/apl --script
 ⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝
⍝          ⍝
⍝ wordcount.apl                        2021-12-26  20:07:07 (GMT-8)  ⍝
⍝          ⍝
 ⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝

⍝ function edif has ufun1 pointer 0!

∇r ← timeMs; t
  t ← ⎕TS
  r ← (((t[3]×86400)+(t[4]×3600)+(t[5]×60)+(t[6]))×1000)+t[7]
∇

∇r ← lowerAndStrip s;stripped;mixedCase
 stripped ← ' abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz*'
mixedCase ← ⎕av[11],',.?!;:"''()[]-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
 r ← stripped[mixedCase ⍳ s]
∇

∇c ← h wcIterative fname
  ⍝⍝;D;WL;idx;len;word;wc;wcl;idx
⍝⍝ Return ⍒-sorted count of unique words in string vector D,ignoring case and punctuation
  ⍝⍝ @param h(⍺) - how many top word counts to return
  ⍝⍝ @param D(⍵) - vector of words
  ⍝⍝⍝⍝
  D ← lowerAndStrip (⎕fio['read_file'] fname)  ⍝ raw text with newlines
  timeStart ← timeMs
  D ← (~ D ∊ ' ') ⊂ D ⍝ make into a vector of words
  WL ← ∪D
  ⍝⍝#DEBUG# ⎕ ← 'unique words:',WL
  wcl ← 0⍴0
  idx ← 1
  len ← ⍴WL
count:
  ⍝⍝#DEBUG# ⎕ ← idx
  →(idx>len)/done
  word ← ⊂idx⊃WL
  ⍝⍝#DEBUG# ⎕ ← word
  wc ← +/(word≡¨D)
  wcl ← wcl,wc
  ⍝⍝#DEBUG# ⎕ ← wcl
  idx ← 1+idx
  → count
done:
  c ← h↑[2] (WL)[⍒wcl],[0.5]wcl[⍒wcl]
  timeEnd ← timeMs
  ⎕ ← 'Time:',(timeEnd-timeStart),'ms'
∇

∇r ← n wcOuterProd fname
  ⍝⍝ ;D;wl;uniqwords;wordcounts;sortOrder
  D ← lowerAndStrip (⎕fio['read_file'] fname)  ⍝ raw text with newlines
  timeStart ← timeMs
  wl ← (~ D ∊ ' ') ⊂ D
  ⍝⍝#DEBUG# ⎕ ← '⍴ wl:', ⍴ wl
  uniqwords ← ∪wl
  ⍝⍝#DEBUG# ⎕ ← '⍴ uniqwords:', ⍴ uniqwords
  wordcounts ← +⌿(wl ∘.≡ uniqwords)
  sortOrder ← ⍒wordcounts
  r ← n↑[2] uniqwords[sortOrder],[0.5]wordcounts[sortOrder]
  timeEnd ← timeMs
  ⎕ ← 'Time:',(timeEnd-timeStart),'ms'
∇

Re: Absolute limits of rank 2 bool matrix size in GNU APL?

Reply via email to