Hi,

it works on my machine and is much faster than the previous approach.

/// Jürgen


On 09/13/2016 05:25 PM, Jay Foad wrote:
This looks like you are applying desc to an array that does not have rank 2. I don't see how that can happen if you entered this exactly, since the argument of desc must have shape 39 2:

desc 39 2⍴(⍪u),≢¨⊂⍨x[⍋x←u⍳w]

Jay.

On 12 September 2016 at 18:34, Ala'a Mohammad <amal...@gmail.com> wrote:
Thanks for the alternative, I'd tried to run it, but got Rank Error

RANK ERROR
λ1[1]  λ←⍵[⍒⍵[;2];]
            ^    ^

How can I help debug this?

Regards,

Ala'a

On Mon, Sep 12, 2016 at 5:32 PM, Jay Foad <jay.f...@gmail.com> wrote:
> Hi Ala'a,
>
> How about replacing the last line with this? It runs in about 1 minute on my
> machine:
>
> desc 39 2⍴(⍪u),≢¨⊂⍨x[⍋x←u⍳w]
>
> Jay.
>
> On 11 September 2016 at 19:23, Ala'a Mohammad <amal...@gmail.com> wrote:
>>
>> Just an update as a reference, I'm now able to parse the big.txt file
>> (without WS full or killed process), but it takes around 2 Hours and
>> 20 Minutes +-10 minutes. (around 1M words, 30K are unique). The
>> process reach 1GiB (after parsing the words), and tops that with
>> 100MiB during the sequential 'Each' (thus a max of 1.1GiB).
>>
>> The only change is scanning each unique word against the whole words
>> vector.
>>
>> Below is the code with a sample timed run.
>>
>> Regards,
>>
>> Ala'a
>>
>> ⍝ fhist.apl
>> a ← 'abcdefghijklmnopqrstuvwxyz' ◊ A ← 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
>> downcase ← { (a,⎕AV)[(A,⎕AV)⍳⍵] }
>> nl ← ⎕UCS 13 ◊ cr ← ⎕UCS 10 ◊ tab ← ⎕UCS 9
>> nonalpha ← nl, cr, tab, ' 0123456789()[]!?%$,.:;/+*=<>-_#"`~@&'
>> alphamask ← { ~ ⍵ ∊ nonalpha }
>> words ← { (alphamask ⍵) ⊂ downcase ⍵ }
>> desc ← {⍵[⍒⍵[;2];]}
>> ftxt ← { ⎕FIO[26] ⍵ }
>>
>> file ← '/misc/big.txt' ⍝ ~ 6.2M
>> ⎕ ← ⍴w ← words ftxt file
>> ⎕ ← ⍴u ← ∪w
>> desc 39 2 ⍴ (⍪u),{+/(⊂⍵)∘.≡w}¨u
>> )OFF
>>
>> : time apl -s -f fhist.apl
>> 1098281
>> 30377
>>  the            80003
>>  of             40025
>>  to             28760
>>  in             22048
>>  for             6936
>>  by              6736
>>  be              6154
>>  or              5349
>>  all             4141
>>  this            4058
>>  are             3627
>>  other           1488
>>  before          1363
>>  should          1297
>>  over            1282
>>  your            1276
>>  any             1204
>>  our             1065
>>  holmes           450
>>  country          417
>>  world            355
>>  project          286
>>  gutenberg        262
>>  laws             233
>>  sir              176
>>  series           128
>>  sure             123
>>  sherlock         101
>>  ebook             85
>>  copyright         69
>>  changing          44
>>  check             38
>>  arthur            30
>>  adventures        17
>>  redistributing     7
>>  header             7
>>  doyle              5
>>  downloading        5
>>  conan              4
>>
>> apl -s -f fhist.apl  8901.96s user 5.78s system 99% cpu 2:28:38.61 total
>>
>> On Sat, Sep 10, 2016 at 12:02 PM, Ala'a Mohammad <amal...@gmail.com>
>> wrote:
>> > Thanks to all for the input,
>> >
>> > Replacing Find and Each OR with Match helped, now I'm parsing a 159K
>> > (~1545 lines) text file (a sample chunk from the big.txt).
>> >
>> > The strange thing for me that I'm trying to understand is that the APL
>> > process (when fed the 159K text file) start allocating memory until it
>> > reaches 2.7GiB, then after printing the result settle down to 50MiB.
>> > Why do I need 2.7GiB? is there any memory utils (i.e. Garbage
>> > collection utility) which can be used to mitigate this issue?
>> >
>> > Here is the updated code:
>> >
>> > a ← 'abcdefghijklmnopqrstuvwxyz'
>> > A ← 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
>> > downcase ← { (a,⎕AV)[(A,⎕AV)⍳⍵] }
>> > nl ← ⎕UCS 13 ◊ cr ← ⎕UCS 10 ◊ tab ← ⎕UCS 9
>> > nonalpha ← nl, cr, tab, ' 0123456789()[]!?%$,.:;/+*=<>-_#"`~@&'
>> > alphamask ← { ~ ⍵ ∊ nonalpha }
>> > words ← { (alphamask ⍵) ⊂ downcase ⍵ }
>> > hist ← { (⍪∪⍵),+/(∪⍵)∘.≡⍵ } ⍝ as suggested by Kacper
>> > desc ← {⍵[⍒⍵[;2];]}
>> > ftxt ← { ⎕FIO[26] ⍵ }
>> > fhist ← { hist words ftxt ⍵ }
>> >
>> > file ← '/misc/llaa' ⍝ llaa contains 1546 text lines
>> > ⎕ ← ⍴w ← words ftxt file
>> > ⎕ ← ⍴u ← ∪w
>> > desc 39 2 ⍴ fhist file
>> >
>> > And here is a sample run
>> > : apl -s -f fhist.apl
>> > 30186
>> > 4155
>> >  the            1560
>> >  to              804
>> >  of              781
>> >  in              493
>> >  for             219
>> >  be              173
>> >  holmes          164
>> >  your            132
>> >  this            114
>> >  all              99
>> >  by               97
>> >  are              97
>> >  or               73
>> >  other            56
>> >  over             51
>> >  our              48
>> >  should           47
>> >  before           43
>> >  sherlock         39
>> >  any              35
>> >  sir              26
>> >  sure             13
>> >  country           9
>> >  project           6
>> >  gutenberg         6
>> >  ebook             5
>> >  adventures        5
>> >  world             5
>> >  arthur            4
>> >  conan             4
>> >  doyle             4
>> >  series            2
>> >  copyright         2
>> >  laws              2
>> >  check             2
>> >  header            2
>> >  changing          1
>> >  downloading       1
>> >  redistributing    1
>> >
>> > Also attached the sample input file
>> >
>> > Regards,
>> >
>> > On Sat, Sep 10, 2016 at 9:20 AM, Kacper Gutowski <mwgam...@gmail.com>
>> > wrote:
>> >> On 9 September 2016 at 23:39, Ala'a Mohammad wrote:
>> >>> the errors happened inside 'hist' function, and I presume mostly due
>> >>> to the jot dot find (if understand correctly, operating on a matrix of
>> >>> length equal to : unique-length * words-length)
>> >>
>> >> Try (∪⍵)∘.≡⍵ instead of ∨/¨(∪⍵)∘.⍷⍵.
>> >>
>> >> -k
>>
>


Reply via email to