Thanks for the alternative, I'd tried to run it, but got Rank Error RANK ERROR λ1[1] λ←⍵[⍒⍵[;2];] ^ ^
How can I help debug this? Regards, Ala'a On Mon, Sep 12, 2016 at 5:32 PM, Jay Foad <jay.f...@gmail.com> wrote: > Hi Ala'a, > > How about replacing the last line with this? It runs in about 1 minute on my > machine: > > desc 39 2⍴(⍪u),≢¨⊂⍨x[⍋x←u⍳w] > > Jay. > > On 11 September 2016 at 19:23, Ala'a Mohammad <amal...@gmail.com> wrote: >> >> Just an update as a reference, I'm now able to parse the big.txt file >> (without WS full or killed process), but it takes around 2 Hours and >> 20 Minutes +-10 minutes. (around 1M words, 30K are unique). The >> process reach 1GiB (after parsing the words), and tops that with >> 100MiB during the sequential 'Each' (thus a max of 1.1GiB). >> >> The only change is scanning each unique word against the whole words >> vector. >> >> Below is the code with a sample timed run. >> >> Regards, >> >> Ala'a >> >> ⍝ fhist.apl >> a ← 'abcdefghijklmnopqrstuvwxyz' ◊ A ← 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' >> downcase ← { (a,⎕AV)[(A,⎕AV)⍳⍵] } >> nl ← ⎕UCS 13 ◊ cr ← ⎕UCS 10 ◊ tab ← ⎕UCS 9 >> nonalpha ← nl, cr, tab, ' 0123456789()[]!?%$,.:;/+*=<>-_#"`~@&' >> alphamask ← { ~ ⍵ ∊ nonalpha } >> words ← { (alphamask ⍵) ⊂ downcase ⍵ } >> desc ← {⍵[⍒⍵[;2];]} >> ftxt ← { ⎕FIO[26] ⍵ } >> >> file ← '/misc/big.txt' ⍝ ~ 6.2M >> ⎕ ← ⍴w ← words ftxt file >> ⎕ ← ⍴u ← ∪w >> desc 39 2 ⍴ (⍪u),{+/(⊂⍵)∘.≡w}¨u >> )OFF >> >> : time apl -s -f fhist.apl >> 1098281 >> 30377 >> the 80003 >> of 40025 >> to 28760 >> in 22048 >> for 6936 >> by 6736 >> be 6154 >> or 5349 >> all 4141 >> this 4058 >> are 3627 >> other 1488 >> before 1363 >> should 1297 >> over 1282 >> your 1276 >> any 1204 >> our 1065 >> holmes 450 >> country 417 >> world 355 >> project 286 >> gutenberg 262 >> laws 233 >> sir 176 >> series 128 >> sure 123 >> sherlock 101 >> ebook 85 >> copyright 69 >> changing 44 >> check 38 >> arthur 30 >> adventures 17 >> redistributing 7 >> header 7 >> doyle 5 >> downloading 5 >> conan 4 >> >> apl -s -f fhist.apl 8901.96s user 5.78s system 99% cpu 2:28:38.61 total >> >> On Sat, Sep 10, 2016 at 12:02 PM, Ala'a Mohammad <amal...@gmail.com> >> wrote: >> > Thanks to all for the input, >> > >> > Replacing Find and Each OR with Match helped, now I'm parsing a 159K >> > (~1545 lines) text file (a sample chunk from the big.txt). >> > >> > The strange thing for me that I'm trying to understand is that the APL >> > process (when fed the 159K text file) start allocating memory until it >> > reaches 2.7GiB, then after printing the result settle down to 50MiB. >> > Why do I need 2.7GiB? is there any memory utils (i.e. Garbage >> > collection utility) which can be used to mitigate this issue? >> > >> > Here is the updated code: >> > >> > a ← 'abcdefghijklmnopqrstuvwxyz' >> > A ← 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' >> > downcase ← { (a,⎕AV)[(A,⎕AV)⍳⍵] } >> > nl ← ⎕UCS 13 ◊ cr ← ⎕UCS 10 ◊ tab ← ⎕UCS 9 >> > nonalpha ← nl, cr, tab, ' 0123456789()[]!?%$,.:;/+*=<>-_#"`~@&' >> > alphamask ← { ~ ⍵ ∊ nonalpha } >> > words ← { (alphamask ⍵) ⊂ downcase ⍵ } >> > hist ← { (⍪∪⍵),+/(∪⍵)∘.≡⍵ } ⍝ as suggested by Kacper >> > desc ← {⍵[⍒⍵[;2];]} >> > ftxt ← { ⎕FIO[26] ⍵ } >> > fhist ← { hist words ftxt ⍵ } >> > >> > file ← '/misc/llaa' ⍝ llaa contains 1546 text lines >> > ⎕ ← ⍴w ← words ftxt file >> > ⎕ ← ⍴u ← ∪w >> > desc 39 2 ⍴ fhist file >> > >> > And here is a sample run >> > : apl -s -f fhist.apl >> > 30186 >> > 4155 >> > the 1560 >> > to 804 >> > of 781 >> > in 493 >> > for 219 >> > be 173 >> > holmes 164 >> > your 132 >> > this 114 >> > all 99 >> > by 97 >> > are 97 >> > or 73 >> > other 56 >> > over 51 >> > our 48 >> > should 47 >> > before 43 >> > sherlock 39 >> > any 35 >> > sir 26 >> > sure 13 >> > country 9 >> > project 6 >> > gutenberg 6 >> > ebook 5 >> > adventures 5 >> > world 5 >> > arthur 4 >> > conan 4 >> > doyle 4 >> > series 2 >> > copyright 2 >> > laws 2 >> > check 2 >> > header 2 >> > changing 1 >> > downloading 1 >> > redistributing 1 >> > >> > Also attached the sample input file >> > >> > Regards, >> > >> > On Sat, Sep 10, 2016 at 9:20 AM, Kacper Gutowski <mwgam...@gmail.com> >> > wrote: >> >> On 9 September 2016 at 23:39, Ala'a Mohammad wrote: >> >>> the errors happened inside 'hist' function, and I presume mostly due >> >>> to the jot dot find (if understand correctly, operating on a matrix of >> >>> length equal to : unique-length * words-length) >> >> >> >> Try (∪⍵)∘.≡⍵ instead of ∨/¨(∪⍵)∘.⍷⍵. >> >> >> >> -k >> >