Hi Ala'a,

How about replacing the last line with this? It runs in about 1 minute on
my machine:

desc 39 2⍴(⍪u),≢¨⊂⍨x[⍋x←u⍳w]

Jay.

On 11 September 2016 at 19:23, Ala'a Mohammad <amal...@gmail.com> wrote:

> Just an update as a reference, I'm now able to parse the big.txt file
> (without WS full or killed process), but it takes around 2 Hours and
> 20 Minutes +-10 minutes. (around 1M words, 30K are unique). The
> process reach 1GiB (after parsing the words), and tops that with
> 100MiB during the sequential 'Each' (thus a max of 1.1GiB).
>
> The only change is scanning each unique word against the whole words
> vector.
>
> Below is the code with a sample timed run.
>
> Regards,
>
> Ala'a
>
> ⍝ fhist.apl
> a ← 'abcdefghijklmnopqrstuvwxyz' ◊ A ← 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
> downcase ← { (a,⎕AV)[(A,⎕AV)⍳⍵] }
> nl ← ⎕UCS 13 ◊ cr ← ⎕UCS 10 ◊ tab ← ⎕UCS 9
> nonalpha ← nl, cr, tab, ' 0123456789()[]!?%$,.:;/+*=<>-_#"`~@&'
> alphamask ← { ~ ⍵ ∊ nonalpha }
> words ← { (alphamask ⍵) ⊂ downcase ⍵ }
> desc ← {⍵[⍒⍵[;2];]}
> ftxt ← { ⎕FIO[26] ⍵ }
>
> file ← '/misc/big.txt' ⍝ ~ 6.2M
> ⎕ ← ⍴w ← words ftxt file
> ⎕ ← ⍴u ← ∪w
> desc 39 2 ⍴ (⍪u),{+/(⊂⍵)∘.≡w}¨u
> )OFF
>
> : time apl -s -f fhist.apl
> 1098281
> 30377
>  the            80003
>  of             40025
>  to             28760
>  in             22048
>  for             6936
>  by              6736
>  be              6154
>  or              5349
>  all             4141
>  this            4058
>  are             3627
>  other           1488
>  before          1363
>  should          1297
>  over            1282
>  your            1276
>  any             1204
>  our             1065
>  holmes           450
>  country          417
>  world            355
>  project          286
>  gutenberg        262
>  laws             233
>  sir              176
>  series           128
>  sure             123
>  sherlock         101
>  ebook             85
>  copyright         69
>  changing          44
>  check             38
>  arthur            30
>  adventures        17
>  redistributing     7
>  header             7
>  doyle              5
>  downloading        5
>  conan              4
>
> apl -s -f fhist.apl  8901.96s user 5.78s system 99% cpu 2:28:38.61 total
>
> On Sat, Sep 10, 2016 at 12:02 PM, Ala'a Mohammad <amal...@gmail.com>
> wrote:
> > Thanks to all for the input,
> >
> > Replacing Find and Each OR with Match helped, now I'm parsing a 159K
> > (~1545 lines) text file (a sample chunk from the big.txt).
> >
> > The strange thing for me that I'm trying to understand is that the APL
> > process (when fed the 159K text file) start allocating memory until it
> > reaches 2.7GiB, then after printing the result settle down to 50MiB.
> > Why do I need 2.7GiB? is there any memory utils (i.e. Garbage
> > collection utility) which can be used to mitigate this issue?
> >
> > Here is the updated code:
> >
> > a ← 'abcdefghijklmnopqrstuvwxyz'
> > A ← 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
> > downcase ← { (a,⎕AV)[(A,⎕AV)⍳⍵] }
> > nl ← ⎕UCS 13 ◊ cr ← ⎕UCS 10 ◊ tab ← ⎕UCS 9
> > nonalpha ← nl, cr, tab, ' 0123456789()[]!?%$,.:;/+*=<>-_#"`~@&'
> > alphamask ← { ~ ⍵ ∊ nonalpha }
> > words ← { (alphamask ⍵) ⊂ downcase ⍵ }
> > hist ← { (⍪∪⍵),+/(∪⍵)∘.≡⍵ } ⍝ as suggested by Kacper
> > desc ← {⍵[⍒⍵[;2];]}
> > ftxt ← { ⎕FIO[26] ⍵ }
> > fhist ← { hist words ftxt ⍵ }
> >
> > file ← '/misc/llaa' ⍝ llaa contains 1546 text lines
> > ⎕ ← ⍴w ← words ftxt file
> > ⎕ ← ⍴u ← ∪w
> > desc 39 2 ⍴ fhist file
> >
> > And here is a sample run
> > : apl -s -f fhist.apl
> > 30186
> > 4155
> >  the            1560
> >  to              804
> >  of              781
> >  in              493
> >  for             219
> >  be              173
> >  holmes          164
> >  your            132
> >  this            114
> >  all              99
> >  by               97
> >  are              97
> >  or               73
> >  other            56
> >  over             51
> >  our              48
> >  should           47
> >  before           43
> >  sherlock         39
> >  any              35
> >  sir              26
> >  sure             13
> >  country           9
> >  project           6
> >  gutenberg         6
> >  ebook             5
> >  adventures        5
> >  world             5
> >  arthur            4
> >  conan             4
> >  doyle             4
> >  series            2
> >  copyright         2
> >  laws              2
> >  check             2
> >  header            2
> >  changing          1
> >  downloading       1
> >  redistributing    1
> >
> > Also attached the sample input file
> >
> > Regards,
> >
> > On Sat, Sep 10, 2016 at 9:20 AM, Kacper Gutowski <mwgam...@gmail.com>
> wrote:
> >> On 9 September 2016 at 23:39, Ala'a Mohammad wrote:
> >>> the errors happened inside 'hist' function, and I presume mostly due
> >>> to the jot dot find (if understand correctly, operating on a matrix of
> >>> length equal to : unique-length * words-length)
> >>
> >> Try (∪⍵)∘.≡⍵ instead of ∨/¨(∪⍵)∘.⍷⍵.
> >>
> >> -k
>
>

Reply via email to