Hi Ala'a, How about replacing the last line with this? It runs in about 1 minute on my machine:
desc 39 2⍴(⍪u),≢¨⊂⍨x[⍋x←u⍳w] Jay. On 11 September 2016 at 19:23, Ala'a Mohammad <amal...@gmail.com> wrote: > Just an update as a reference, I'm now able to parse the big.txt file > (without WS full or killed process), but it takes around 2 Hours and > 20 Minutes +-10 minutes. (around 1M words, 30K are unique). The > process reach 1GiB (after parsing the words), and tops that with > 100MiB during the sequential 'Each' (thus a max of 1.1GiB). > > The only change is scanning each unique word against the whole words > vector. > > Below is the code with a sample timed run. > > Regards, > > Ala'a > > ⍝ fhist.apl > a ← 'abcdefghijklmnopqrstuvwxyz' ◊ A ← 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' > downcase ← { (a,⎕AV)[(A,⎕AV)⍳⍵] } > nl ← ⎕UCS 13 ◊ cr ← ⎕UCS 10 ◊ tab ← ⎕UCS 9 > nonalpha ← nl, cr, tab, ' 0123456789()[]!?%$,.:;/+*=<>-_#"`~@&' > alphamask ← { ~ ⍵ ∊ nonalpha } > words ← { (alphamask ⍵) ⊂ downcase ⍵ } > desc ← {⍵[⍒⍵[;2];]} > ftxt ← { ⎕FIO[26] ⍵ } > > file ← '/misc/big.txt' ⍝ ~ 6.2M > ⎕ ← ⍴w ← words ftxt file > ⎕ ← ⍴u ← ∪w > desc 39 2 ⍴ (⍪u),{+/(⊂⍵)∘.≡w}¨u > )OFF > > : time apl -s -f fhist.apl > 1098281 > 30377 > the 80003 > of 40025 > to 28760 > in 22048 > for 6936 > by 6736 > be 6154 > or 5349 > all 4141 > this 4058 > are 3627 > other 1488 > before 1363 > should 1297 > over 1282 > your 1276 > any 1204 > our 1065 > holmes 450 > country 417 > world 355 > project 286 > gutenberg 262 > laws 233 > sir 176 > series 128 > sure 123 > sherlock 101 > ebook 85 > copyright 69 > changing 44 > check 38 > arthur 30 > adventures 17 > redistributing 7 > header 7 > doyle 5 > downloading 5 > conan 4 > > apl -s -f fhist.apl 8901.96s user 5.78s system 99% cpu 2:28:38.61 total > > On Sat, Sep 10, 2016 at 12:02 PM, Ala'a Mohammad <amal...@gmail.com> > wrote: > > Thanks to all for the input, > > > > Replacing Find and Each OR with Match helped, now I'm parsing a 159K > > (~1545 lines) text file (a sample chunk from the big.txt). > > > > The strange thing for me that I'm trying to understand is that the APL > > process (when fed the 159K text file) start allocating memory until it > > reaches 2.7GiB, then after printing the result settle down to 50MiB. > > Why do I need 2.7GiB? is there any memory utils (i.e. Garbage > > collection utility) which can be used to mitigate this issue? > > > > Here is the updated code: > > > > a ← 'abcdefghijklmnopqrstuvwxyz' > > A ← 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' > > downcase ← { (a,⎕AV)[(A,⎕AV)⍳⍵] } > > nl ← ⎕UCS 13 ◊ cr ← ⎕UCS 10 ◊ tab ← ⎕UCS 9 > > nonalpha ← nl, cr, tab, ' 0123456789()[]!?%$,.:;/+*=<>-_#"`~@&' > > alphamask ← { ~ ⍵ ∊ nonalpha } > > words ← { (alphamask ⍵) ⊂ downcase ⍵ } > > hist ← { (⍪∪⍵),+/(∪⍵)∘.≡⍵ } ⍝ as suggested by Kacper > > desc ← {⍵[⍒⍵[;2];]} > > ftxt ← { ⎕FIO[26] ⍵ } > > fhist ← { hist words ftxt ⍵ } > > > > file ← '/misc/llaa' ⍝ llaa contains 1546 text lines > > ⎕ ← ⍴w ← words ftxt file > > ⎕ ← ⍴u ← ∪w > > desc 39 2 ⍴ fhist file > > > > And here is a sample run > > : apl -s -f fhist.apl > > 30186 > > 4155 > > the 1560 > > to 804 > > of 781 > > in 493 > > for 219 > > be 173 > > holmes 164 > > your 132 > > this 114 > > all 99 > > by 97 > > are 97 > > or 73 > > other 56 > > over 51 > > our 48 > > should 47 > > before 43 > > sherlock 39 > > any 35 > > sir 26 > > sure 13 > > country 9 > > project 6 > > gutenberg 6 > > ebook 5 > > adventures 5 > > world 5 > > arthur 4 > > conan 4 > > doyle 4 > > series 2 > > copyright 2 > > laws 2 > > check 2 > > header 2 > > changing 1 > > downloading 1 > > redistributing 1 > > > > Also attached the sample input file > > > > Regards, > > > > On Sat, Sep 10, 2016 at 9:20 AM, Kacper Gutowski <mwgam...@gmail.com> > wrote: > >> On 9 September 2016 at 23:39, Ala'a Mohammad wrote: > >>> the errors happened inside 'hist' function, and I presume mostly due > >>> to the jot dot find (if understand correctly, operating on a matrix of > >>> length equal to : unique-length * words-length) > >> > >> Try (∪⍵)∘.≡⍵ instead of ∨/¨(∪⍵)∘.⍷⍵. > >> > >> -k > >