Re: efficiently building a large list

2012-06-03 Thread Joe Bogner
Thank you. Very helpful.

I was confused where the value was actually being stored. I was thinking
that in my example it was being stored in the cell in the index. Then, I
couldn't figure out how it retained it's value after I cleared out the
index.

Turns out that it actually stored it back on the original symbol that held
the customer number string.

I'm fairly clear now on it. Definitely an interesting deep dive


On Sun, Jun 3, 2012 at 4:23 AM, Alexander Burger wrote:

> On Sat, Jun 02, 2012 at 06:10:05PM -0400, Joe Bogner wrote:
> > To be more clear, this is the call pattern that I'm referring to:
> >
> > : (for X (idx 'A) (set (car (idx 'A X)) 0)) (Sum)
> > ...
> > As you can see, clearing it before calling Sum gives the correct results.
>
> OK, so this makes sense (also the assumption in my last mail about the
> idx values). You clear the values of the payload items.
>
>
> Note that 'X' in the above loop already contains the payload items
> (symbols or cells). So looking them up again is not necessary. You can
> write
>
>   (for X (idx 'A) (set X 0))
>
>
> A second note: (idx 'A) collects all items in the _whole_ index tree
> into a list. This makes sense only for small trees, of course, and is
> not used often in production code.
>
> Cheers,
> - Alex
> --
> UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe
>


Re: efficiently building a large list

2012-06-03 Thread Alexander Burger
On Sat, Jun 02, 2012 at 06:10:05PM -0400, Joe Bogner wrote:
> To be more clear, this is the call pattern that I'm referring to:
> 
> : (for X (idx 'A) (set (car (idx 'A X)) 0)) (Sum)
> ...
> As you can see, clearing it before calling Sum gives the correct results.

OK, so this makes sense (also the assumption in my last mail about the
idx values). You clear the values of the payload items.


Note that 'X' in the above loop already contains the payload items
(symbols or cells). So looking them up again is not necessary. You can
write

   (for X (idx 'A) (set X 0))


A second note: (idx 'A) collects all items in the _whole_ index tree
into a list. This makes sense only for small trees, of course, and is
not used often in production code.

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe


Re: efficiently building a large list

2012-06-03 Thread Alexander Burger
Hi Joe,

the core of the question is here what you actually stored in the index.

Perhaps it helps if I try to explain the 'idx' structure.

Each node in an 'idx' tree consists either of one (if it is a leaf node)
or two (if it is an internal node) cells. The first cell holds the
payload data in the CAR, and a pointer to the second cell (if any) in
the CDR. The second cell holds pointers to the left and right subtrees.

The 'idx' function treats the payload data as a key. It is up to the
application what to do with that, like storing data there directly, or
storing a (typically transient) symbol as a _key_, while the value of
that symbol holds a _value_.


> (de Sum ()
>   (zero Amount)
>   (bench
> (for This SL
>   (let (Key (: CustNum) Amt (: Amount) Idx (idx 'A Key))

In general, 'idx' returns the first cell of a node (which effectively
means that it returns the whole subtree).

The question is what the cell returned by (idx 'A Key) in your case
contains. The line

> (ifn (num? (val (car Idx))) (set (car Idx) 0))

makes me believe that the cell's CAR holds a 'var' (symbol or cell),
whose value is checked for a number: (num? (val (car Idx)))

The same here

> (set (car Idx) (+ (val (car Idx)) Amt)) ) ) )

The value of the 'var' in the CAR is taken, and the incremented value
stored there. Note that you can write that a bit simpler

  (inc (car Idx) Amt)


So the above assumes that the payload is a 'var' (symbol or cell). The
statement

>   (sum '((X) (car X)) (idx 'A)) )

retrieves all payload items (symbols or cells), takes their values
('car' in this case, so it seems to be cells), and sums them up.

This looks all right. Note that the function wrapper is not needed,
as ((X) (foo X)) is a tautology

   (sum car (idx 'A))

(is "tautology" the right term?)


> I don't know exactly how to phrase the question. I'm storing the total in
> the val of the cell (I think). I would have thought it was in the val of

>From the above, you don't store it in the val (i.e. the CAR) of the tree
nodes, but in the cells contained therein. This also looks correct.


> Here's a simple example that I used to understand the concept:
> 
> : (setq Z "abc")
> -> "abc"
> : (val Z)
> -> "abc"

Are you aware that you evaluated twice here? (val 'Z) gives "abc", and
then you applied 'val' to it, giving "abc" again because transient
symbols point to themselves initially.

> : (set Z 0)
> -> 0

Now you have set the value of the symbol "abc" to zero.
You could test that with

: "abc"
-> 0

> : (val Z)
> -> 0
> : (set Z (+ (val Z) 1))
> -> 1

or simply: (inc Z)

> : (val Z)
> -> 1
> : Z
> -> "abc"

OK

Concerning the above 'idx' troubles, I suggest that you build a small
'idx' tree, and look at it interactively, perhaps also with (edit 'Idx).

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe


Re: efficiently building a large list

2012-06-02 Thread Henrik Sarvell
Here's what I think, note I can't exude the same certainty as Alex,
you have to wait for his comment to get the whole picture.

See this cop paste of my terminal:

: (de Func () (setq L 5))
-> Func
: (Func)
-> 5
: (println L)
5
-> 5

Note how the L variable is set within the function and still holds its
value outside of the function.

This line: (set (car Idx) (+ (val (car Idx)) Amt)) ) ) ) creates
variables in much the same way as my example above.

I don't know what (car Idx) contains in your case but you need to be
careful here, in my earlier example from the vizreader code we got in
effect variables inside symbols, ie:

: (setq "Var" "Hello")
-> "Hello"
: (println "Var")
"Hello"
-> "Hello"

These things are transient, they cease to exist outside of the file
currently being executed, plus they won't replace important stuff in
the PL environment. In your case it would not have mattered though,
but I'm just saying, be careful if you set using non-symbols as names.

In your case, however, you could have used () to clear things out
if you had used transients as containers instead:

: (setq "Var" 123)
-> 123
: (setq S 123)
-> 123
: ()
-> ()
: (println "Var")
"Var"
-> "Var"
: (println S)
123
-> 123

However, be careful with , because as you can see from the reference:

: (setq S "abc")   # Read "abc"
-> "abc"
: (== S "abc") # Read again, get the same symbol
-> T
: ()   # Close scope
-> NIL
: (== S "abc") # Read again, get another symbol
-> NIL

ALL transients disappear.

To sum it up, in our cases the idx is simply pointing to variables
defined elsewhere, clearing out the idx won't clear out what it was
pointing at...




On Sun, Jun 3, 2012 at 5:10 AM, Joe Bogner  wrote:
> To be more clear, this is the call pattern that I'm referring to:
>
> : (for X (idx 'A) (set (car (idx 'A X)) 0)) (Sum)
> 4.492 sec
> -> 495029119
> : (for X (idx 'A) (set (car (idx 'A X)) 0)) (Sum)
> 4.451 sec
> -> 495029119
> : (for X (idx 'A) (set (car (idx 'A X)) 0)) (Sum)
> 4.519 sec
> -> 495029119
> :
> As you can see, clearing it before calling Sum gives the correct results.
>
> On Sat, Jun 2, 2012 at 6:08 PM, Joe Bogner  wrote:
>>
>> Hi Henrik -
>>
>> Thanks for sharing. I used your approach and it ran quickly after I built
>> the index using balance.
>>
>> (bench (setq  SL (by '((X) (get X 'CustNum)) sort L))) T)
>> (bench (setq SLC (mapcar '((This) (: CustNum)) SL)) T)
>> (off A) (bench (balance 'A SLC T))
>>
>> I'm stumped one piece. If I run the below code multiple times then my
>> total increases
>>
>> : (Sum)
>> 4.466 sec
>> -> 495029119
>> : (Sum)
>> 4.497 sec
>> -> 990058238
>> : (Sum)
>> 4.507 sec
>> -> 1485087357
>>
>> (de Sum ()
>>   (zero Amount)
>>   (bench
>>     (for This SL
>>       (let (Key (: CustNum) Amt (: Amount) Idx (idx 'A Key))
>>         (setq Amt (if Amt Amt 0))
>>         (inc 'Amount Amt) #check figure to make sure it sums up
>>
>>         # the val of the cell is by default a customer number, set it to
>> be 0 if it's non-numeric
>>         (ifn (num? (val (car Idx))) (set (car Idx) 0))
>>
>>         (set (car Idx) (+ (val (car Idx)) Amt)) ) ) )
>>   (sum '((X) (car X)) (idx 'A)) )
>>
>>
>> I don't know exactly how to phrase the question. I'm storing the total in
>> the val of the cell (I think). I would have thought it was in the val of the
>> cell stored in the index. However, if I
>>
>> (off A) (bench (balance 'A SLA T))
>>
>> , it still duplicates.
>>
>> If I run this first, it clears it out: (for X (idx 'A) (set (car (idx 'A
>> X)) 0))
>>
>> Where is the value being stored such that I need to set each value of the
>> cell to 0 regardless of rebuilding  the index?
>>
>>
>> Here's a simple example that I used to understand the concept:
>>
>> : (setq Z "abc")
>> -> "abc"
>> : (val Z)
>> -> "abc"
>> : (set Z 0)
>> -> 0
>> : (val Z)
>> -> 0
>> : (set Z (+ (val Z) 1))
>> -> 1
>> : (val Z)
>> -> 1
>> : Z
>> -> "abc"
>>
>> Like your example, I think I'm storing the number in the val of the symbol
>> (cell).
>>
>> I apologize for the long winded question
>>
>> Thanks
>> Joe
>>
>>
>>
>>
>> On Fri, Jun 1, 2012 at 1:38 AM, Henrik Sarvell  wrote:
>>>
>>> I noticed you were talking about idx.
>>>
>>> The below code is from vizreader and was part of a system that counted
>>> and stored all the non-common words in every article:
>>>
>>> # We extract all words from the article without special characters and
>>> count them
>>> (dm words> (L)
>>>   (let Words NIL
>>>      (for W L
>>>         (and
>>>            (setq W (lowc (pack W)))
>>>            (not (common?> This W))
>>>            (if (idx 'Words W T)
>>>               (inc (car @))
>>>               (set W 1
>>>      (idx 'Words)))
>>>
>>> It is using idx and summing up the occurrences of each word and turned
>>> out to be the fastest way of solving that problem anyway, maybe it's
>>> helpful to you.
>>>
>>>
>>>
>>>
>>> On Fri, Jun 1, 2012 at 10:33 AM, Joe Bogner  wrote:
>>> > Thanks T

Re: efficiently building a large list

2012-06-02 Thread Joe Bogner
To be more clear, this is the call pattern that I'm referring to:

: (for X (idx 'A) (set (car (idx 'A X)) 0)) (Sum)
4.492 sec
-> 495029119
: (for X (idx 'A) (set (car (idx 'A X)) 0)) (Sum)
4.451 sec
-> 495029119
: (for X (idx 'A) (set (car (idx 'A X)) 0)) (Sum)
4.519 sec
-> 495029119
:
As you can see, clearing it before calling Sum gives the correct results.

On Sat, Jun 2, 2012 at 6:08 PM, Joe Bogner  wrote:

> Hi Henrik -
>
> Thanks for sharing. I used your approach and it ran quickly after I built
> the index using balance.
>
> (bench (setq  SL (by '((X) (get X 'CustNum)) sort L))) T)
> (bench (setq SLC (mapcar '((This) (: CustNum)) SL)) T)
> (off A) (bench (balance 'A SLC T))
>
> I'm stumped one piece. If I run the below code multiple times then my
> total increases
>
> : (Sum)
> 4.466 sec
> -> 495029119
> : (Sum)
> 4.497 sec
> -> 990058238
> : (Sum)
> 4.507 sec
> -> 1485087357
>
> (de Sum ()
>   (zero Amount)
>   (bench
> (for This SL
>   (let (Key (: CustNum) Amt (: Amount) Idx (idx 'A Key))
> (setq Amt (if Amt Amt 0))
> (inc 'Amount Amt) #check figure to make sure it sums up
>
> # the val of the cell is by default a customer number, set it to
> be 0 if it's non-numeric
> (ifn (num? (val (car Idx))) (set (car Idx) 0))
>
> (set (car Idx) (+ (val (car Idx)) Amt)) ) ) )
>   (sum '((X) (car X)) (idx 'A)) )
>
>
> I don't know exactly how to phrase the question. I'm storing the total in
> the val of the cell (I think). I would have thought it was in the val of
> the cell stored in the index. However, if I
>
> (off A) (bench (balance 'A SLA T))
>
> , it still duplicates.
>
> If I run this first, it clears it out: (for X (idx 'A) (set (car (idx 'A
> X)) 0))
>
> Where is the value being stored such that I need to set each value of the
> cell to 0 regardless of rebuilding  the index?
>
>
> Here's a simple example that I used to understand the concept:
>
> : (setq Z "abc")
> -> "abc"
> : (val Z)
> -> "abc"
> : (set Z 0)
> -> 0
> : (val Z)
> -> 0
> : (set Z (+ (val Z) 1))
> -> 1
> : (val Z)
> -> 1
> : Z
> -> "abc"
>
> Like your example, I think I'm storing the number in the val of the symbol
> (cell).
>
> I apologize for the long winded question
>
> Thanks
> Joe
>
>
>
>
> On Fri, Jun 1, 2012 at 1:38 AM, Henrik Sarvell  wrote:
>
>> I noticed you were talking about idx.
>>
>> The below code is from vizreader and was part of a system that counted
>> and stored all the non-common words in every article:
>>
>> # We extract all words from the article without special characters and
>> count them
>> (dm words> (L)
>>   (let Words NIL
>>  (for W L
>> (and
>>(setq W (lowc (pack W)))
>>(not (common?> This W))
>>(if (idx 'Words W T)
>>   (inc (car @))
>>   (set W 1
>>  (idx 'Words)))
>>
>> It is using idx and summing up the occurrences of each word and turned
>> out to be the fastest way of solving that problem anyway, maybe it's
>> helpful to you.
>>
>>
>>
>>
>> On Fri, Jun 1, 2012 at 10:33 AM, Joe Bogner  wrote:
>> > Thanks Tomas, I've started using nil now.
>> >
>> >  This is what I came up with to aggregate the data. It actually runs
>> > reasonably well. I'm sharing because I always enjoy reading other
>> people's
>> > picoLisp code so I figure others may as well.
>> >
>> > My source file has 4 million rows
>> >
>> > : (bench (pivot L 'CustNum))
>> > 35.226 sec
>> >
>> > # outputs 31,000 rows.
>> >
>> > My approach is to load it in as follows:
>> >
>> > (class +Invoice)
>> > (rel CustNum (+String))
>> > (rel ProdNum (+String))
>> > (rel Amount (+Number))
>> > (rel Quantity (+Number))
>> >
>> > (de Load ()
>> >   (zero N)
>> >   (setq L (make (
>> >   (in "invoices.txt"
>> > (until (eof)
>> >   (setq Line (line) )
>> >   (setq D (mapcar pack (split Line "^I")))
>> >   (link (new
>> > '(+Invoice)
>> > 'CustNum (car (nth D 1))
>> > 'ProdNum (car (nth D 2))
>> > 'Amount (format (car (nth D 3)))
>> > 'Quantity (format (car (nth D 4))) )) ) ) ) ) ) T )
>> >
>> >
>> > I can probably clean this up.  I tinkered around with various
>> approaches and
>> > this was the best I could come up with in a few hours. At first I was
>> using
>> > something like the group from lib.l but found it to be too slow. I
>> think it
>> > was due to the fact that I optimize for a sorted list instead of
>> scanning
>> > for a match in the made list
>> >
>> > (de sortedGroup (List Fld)
>> >   (make
>> > (let (Last NIL LastSym NIL)
>> >  (for This List
>> >   (let Key (get This Fld)
>> > (if (<> Last Key)
>> > (prog
>> > (if LastSym (link LastSym))
>> > (off LastSym)
>> > (push 'LastSym Key)) )
>> >  (push 'LastSym This)
>> >  (setq Last Key) ) )
>> >  (link LastSym)) ) )
>> >
>> > And here's the piece that ties it all together:
>> >
>> > (de pivot (L Fld)
>> >   (let (SL (by '((

Re: efficiently building a large list

2012-06-02 Thread Joe Bogner
Hi Henrik -

Thanks for sharing. I used your approach and it ran quickly after I built
the index using balance.

(bench (setq  SL (by '((X) (get X 'CustNum)) sort L))) T)
(bench (setq SLC (mapcar '((This) (: CustNum)) SL)) T)
(off A) (bench (balance 'A SLC T))

I'm stumped one piece. If I run the below code multiple times then my total
increases

: (Sum)
4.466 sec
-> 495029119
: (Sum)
4.497 sec
-> 990058238
: (Sum)
4.507 sec
-> 1485087357

(de Sum ()
  (zero Amount)
  (bench
(for This SL
  (let (Key (: CustNum) Amt (: Amount) Idx (idx 'A Key))
(setq Amt (if Amt Amt 0))
(inc 'Amount Amt) #check figure to make sure it sums up

# the val of the cell is by default a customer number, set it to be
0 if it's non-numeric
(ifn (num? (val (car Idx))) (set (car Idx) 0))

(set (car Idx) (+ (val (car Idx)) Amt)) ) ) )
  (sum '((X) (car X)) (idx 'A)) )


I don't know exactly how to phrase the question. I'm storing the total in
the val of the cell (I think). I would have thought it was in the val of
the cell stored in the index. However, if I

(off A) (bench (balance 'A SLA T))

, it still duplicates.

If I run this first, it clears it out: (for X (idx 'A) (set (car (idx 'A
X)) 0))

Where is the value being stored such that I need to set each value of the
cell to 0 regardless of rebuilding  the index?


Here's a simple example that I used to understand the concept:

: (setq Z "abc")
-> "abc"
: (val Z)
-> "abc"
: (set Z 0)
-> 0
: (val Z)
-> 0
: (set Z (+ (val Z) 1))
-> 1
: (val Z)
-> 1
: Z
-> "abc"

Like your example, I think I'm storing the number in the val of the symbol
(cell).

I apologize for the long winded question

Thanks
Joe




On Fri, Jun 1, 2012 at 1:38 AM, Henrik Sarvell  wrote:

> I noticed you were talking about idx.
>
> The below code is from vizreader and was part of a system that counted
> and stored all the non-common words in every article:
>
> # We extract all words from the article without special characters and
> count them
> (dm words> (L)
>   (let Words NIL
>  (for W L
> (and
>(setq W (lowc (pack W)))
>(not (common?> This W))
>(if (idx 'Words W T)
>   (inc (car @))
>   (set W 1
>  (idx 'Words)))
>
> It is using idx and summing up the occurrences of each word and turned
> out to be the fastest way of solving that problem anyway, maybe it's
> helpful to you.
>
>
>
>
> On Fri, Jun 1, 2012 at 10:33 AM, Joe Bogner  wrote:
> > Thanks Tomas, I've started using nil now.
> >
> >  This is what I came up with to aggregate the data. It actually runs
> > reasonably well. I'm sharing because I always enjoy reading other
> people's
> > picoLisp code so I figure others may as well.
> >
> > My source file has 4 million rows
> >
> > : (bench (pivot L 'CustNum))
> > 35.226 sec
> >
> > # outputs 31,000 rows.
> >
> > My approach is to load it in as follows:
> >
> > (class +Invoice)
> > (rel CustNum (+String))
> > (rel ProdNum (+String))
> > (rel Amount (+Number))
> > (rel Quantity (+Number))
> >
> > (de Load ()
> >   (zero N)
> >   (setq L (make (
> >   (in "invoices.txt"
> > (until (eof)
> >   (setq Line (line) )
> >   (setq D (mapcar pack (split Line "^I")))
> >   (link (new
> > '(+Invoice)
> > 'CustNum (car (nth D 1))
> > 'ProdNum (car (nth D 2))
> > 'Amount (format (car (nth D 3)))
> > 'Quantity (format (car (nth D 4))) )) ) ) ) ) ) T )
> >
> >
> > I can probably clean this up.  I tinkered around with various approaches
> and
> > this was the best I could come up with in a few hours. At first I was
> using
> > something like the group from lib.l but found it to be too slow. I think
> it
> > was due to the fact that I optimize for a sorted list instead of scanning
> > for a match in the made list
> >
> > (de sortedGroup (List Fld)
> >   (make
> > (let (Last NIL LastSym NIL)
> >  (for This List
> >   (let Key (get This Fld)
> > (if (<> Last Key)
> > (prog
> > (if LastSym (link LastSym))
> > (off LastSym)
> > (push 'LastSym Key)) )
> >  (push 'LastSym This)
> >  (setq Last Key) ) )
> >  (link LastSym)) ) )
> >
> > And here's the piece that ties it all together:
> >
> > (de pivot (L Fld)
> >   (let (SL (by '((X) (get X Fld)) sort L) SG (sortedGroup SL Fld))
> > (out "pivot.txt"
> >   (for X SG
> > (let (Amt 0)
> >   (mapc '((This) (inc 'Amt (: Amount))) (cdr (reverse X)))
> >   (setq Key (get (car X) Fld))
> >   (prinl Key "^I" Amt) ) ) ) ) )
> >
> >
> > (Load)
> >
> > : (bench (pivot L 'CustNum))
> > 35.226 sec
> >
> > : (bench (pivot L 'ProdNum))
> > 40.945 sec
> >
> > It seems the best performance was by sorting, then splitting and then
> > summing the individual parts. It also makes for a nice report.
> >
> > Sidenote: At first I thought I was getting better performance by using a
> > modified version of

Re: efficiently building a large list

2012-06-01 Thread Alexander Burger
Hi Joe,

>  This is what I came up with to aggregate the data. It actually runs
> reasonably well. I'm sharing because I always enjoy reading other people's
> picoLisp code so I figure others may as well.

Yes, thanks!

I can't delve into the code's logic at the moment, so just let me make
some isolated comments:


> My approach is to load it in as follows:
> 
> (class +Invoice)
> (rel CustNum (+String))
> (rel ProdNum (+String))
> (rel Amount (+Number))
> (rel Quantity (+Number))

This is correct, but for a practical application it is not enough. All
four relations don't have an index, so the '+Invoice' cannot be found,
and they will be destroyed by the DB garbage collector (in case that
(dbgc) is called).

Also, if you have proper indexes, you might well be able to do the
operations below more efficiently with database operations, with the
additional benefit of persistence.


> (de Load ()
>   (zero N)

'N' seems not to be used.

>   (setq L (make (
>   (in "invoices.txt"

The open paren after 'make' is wrong. It causes the return value of the
'in' expression to be executed as a function. This may cause all kinds
of unpredictable things to happen.

Also, 'L' is a free variable then, possibly destroying an 'L' in the
calling environment. Better use something like

   (let L
  (make
 (in "invoices.txt"
...


> (until (eof)
>   (setq Line (line) )
>   (setq D (mapcar pack (split Line "^I")))

Same here. Better do

  (until (eof)
 (let (Line (line)  D (mapcar pack (split Line "^I")))
...

or, even better because 'Line' is not needed

  (until (eof)
 (let D (mapcar pack (split (line) "^I"))
...


>   (link (new
> '(+Invoice)
> 'CustNum (car (nth D 1))
> 'ProdNum (car (nth D 2))
> 'Amount (format (car (nth D 3)))
> 'Quantity (format (car (nth D 4))) )) ) ) ) ) ) T )

Instead of (car (nth D 1)) etc. you can use (get D 1) or simply 'car'.



> (de sortedGroup (List Fld)
>   (make
> (let (Last NIL LastSym NIL)
>  (for This List
>   (let Key (get This Fld)
> (if (<> Last Key)
> (prog
> (if LastSym (link LastSym))
> (off LastSym)
> (push 'LastSym Key)) )

As a minor improvement: An 'if' without and 'else' can always be
replaced with 'when', and then also the 'prog' is not needed

  (when (<> Last Key)
 (and LastSym (link @))
 (off LastSym)
 (push 'LastSym Key) )

The 'off' followed by 'push' is a bit redundant. You could use

   (setq LastSym (cons Key))

instead.

Just a few cents ...

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe


Re: efficiently building a large list

2012-05-31 Thread Alexander Burger
On Thu, May 31, 2012 at 01:38:41PM -0400, Joe Bogner wrote:
> Calculating a SUM is very quick..
> 
> ! (bench (let Amt 0 (mapc '((This) (inc 'Amt (: Amount))) List)))
> 1.152 sec

Minor optimization: You could also use 'sum' for that:

   (sum '((This) (: Amount)) List)



> Sidebar: Is there a way to disable the interactive session from printing
> the return of a statement? For example, if I do a (setq ABC L) where L is a
> million items, I'd prefer the option of not having all million items print
> on my console. I've worked around this by wrapping it in a prog and
> returning NIL. Is there an easier way?

As Tomas wrote in his answer to this thread, you could use 'nil' or 't'

   : (nil (longResult))

I usually prefer

   : (longResult) T

This works because the input line may hold more than one expression, and
only the value of the last expression is printed in the REPL. I use 'T'
because it is the shortest to type.

Usually you can also do something useful, like printing the result you
are interested in, and which is produced in a side effect:

   : (longResult) Count

   : (longResult) (show (someObject))



> As a simple test to get a list of unique customers
> 
> ! (bench (mapc '((X) (idx 'UniqueCustomers X T)) CustomerNumbers))
> 163.309 sec
> 
> Am I doing something wrong with idx?

It depends a lot on the structure of the input list 'CustomerNumbers'.

I think your approach is all right if 'CustomerNumbers' is very
long, but holds the numbers in random order. Then you might also
use

   (uniq L)


The 'idx' reference says: "If all elements are inserted in sorted order,
the tree degenerates into a linear list."

I suspect that CustomerNumbers are sorted (or nearly sorted), and
contain many distinct values. Is that the case?

'idx' is a rather low-level tool, giving responsibility for
optimizations to the programmer. If you know that the input list is
already sorted (or at least very un-random), you could try 'balance'

   (balance 'UniqueCustomers CustomerNumbers)



> I'm wondering if accu runs better
> because I remember reading that picoLisp internally uses hashes for symbols
> and properties and I think accu is using a property of the symbol to store
> the values.

PicoLisp hashes symbol names, but this doesn't matter here. The names
hashes are used only for 'read'ing. Properties are not hashed in any
way, they are searched linearly in the property list of a symbol.

'accu' uses a simple association list. It is only fast if the number of
keys is not too high.

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe


Re: efficiently building a large list

2012-05-31 Thread Tomas Hlavaty
Hi Joe,

> (de Load ()
>   (zero N)
>   (setq L (make (
>   (in "invoices.txt"
> (until (eof)
>   (setq Line (line) )
>   (setq D (mapcar pack (split Line "^I")))
>   (link (new
> '(+Invoice)
> 'CustNum (car (nth D 1))
> 'ProdNum (car (nth D 2))
> 'Amount (format (car (nth D 3)))
> 'Quantity (format (car (nth D 4))) )) ) ) ) ) ) T )

just a stylistic comment, your indenting is rather confusing.  I would
recommend sticking with the PicoLisp convention when writing PicoLisp
code; I guess it's described or discussed somewhere, or look at existing
PicoLisp code.

Cheers,

Tomas
-- 
UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe


Re: efficiently building a large list

2012-05-31 Thread Henrik Sarvell
I noticed you were talking about idx.

The below code is from vizreader and was part of a system that counted
and stored all the non-common words in every article:

# We extract all words from the article without special characters and
count them
(dm words> (L)
   (let Words NIL
  (for W L
 (and
(setq W (lowc (pack W)))
(not (common?> This W))
(if (idx 'Words W T)
   (inc (car @))
   (set W 1
  (idx 'Words)))

It is using idx and summing up the occurrences of each word and turned
out to be the fastest way of solving that problem anyway, maybe it's
helpful to you.




On Fri, Jun 1, 2012 at 10:33 AM, Joe Bogner  wrote:
> Thanks Tomas, I've started using nil now.
>
>  This is what I came up with to aggregate the data. It actually runs
> reasonably well. I'm sharing because I always enjoy reading other people's
> picoLisp code so I figure others may as well.
>
> My source file has 4 million rows
>
> : (bench (pivot L 'CustNum))
> 35.226 sec
>
> # outputs 31,000 rows.
>
> My approach is to load it in as follows:
>
> (class +Invoice)
> (rel CustNum (+String))
> (rel ProdNum (+String))
> (rel Amount (+Number))
> (rel Quantity (+Number))
>
> (de Load ()
>   (zero N)
>   (setq L (make (
>   (in "invoices.txt"
>     (until (eof)
>       (setq Line (line) )
>       (setq D (mapcar pack (split Line "^I")))
>       (link (new
>         '(+Invoice)
>         'CustNum (car (nth D 1))
>         'ProdNum (car (nth D 2))
>         'Amount (format (car (nth D 3)))
>         'Quantity (format (car (nth D 4))) )) ) ) ) ) ) T )
>
>
> I can probably clean this up.  I tinkered around with various approaches and
> this was the best I could come up with in a few hours. At first I was using
> something like the group from lib.l but found it to be too slow. I think it
> was due to the fact that I optimize for a sorted list instead of scanning
> for a match in the made list
>
> (de sortedGroup (List Fld)
>   (make
>     (let (Last NIL LastSym NIL)
>      (for This List
>       (let Key (get This Fld)
>         (if (<> Last Key)
>             (prog
>             (if LastSym (link LastSym))
>             (off LastSym)
>             (push 'LastSym Key)) )
>          (push 'LastSym This)
>          (setq Last Key) ) )
>          (link LastSym)) ) )
>
> And here's the piece that ties it all together:
>
> (de pivot (L Fld)
>   (let (SL (by '((X) (get X Fld)) sort L) SG (sortedGroup SL Fld))
>     (out "pivot.txt"
>       (for X SG
>         (let (Amt 0)
>           (mapc '((This) (inc 'Amt (: Amount))) (cdr (reverse X)))
>           (setq Key (get (car X) Fld))
>           (prinl Key "^I" Amt) ) ) ) ) )
>
>
> (Load)
>
> : (bench (pivot L 'CustNum))
> 35.226 sec
>
> : (bench (pivot L 'ProdNum))
> 40.945 sec
>
> It seems the best performance was by sorting, then splitting and then
> summing the individual parts. It also makes for a nice report.
>
> Sidenote: At first I thought I was getting better performance by using a
> modified version of quicksort off rosetta code, but then I switched it to
> the built-in sort and saw considerably better speed.
>
> Thanks for the help everyone
>
> On Thu, May 31, 2012 at 3:37 PM, Tomas Hlavaty  wrote:
>>
>> Hi Joe,
>>
>> > Sidebar: Is there a way to disable the interactive session from
>> > printing the return of a statement? For example, if I do a (setq ABC
>> > L) where L is a million items, I'd prefer the option of not having all
>> > million items print on my console. I've worked around this by wrapping
>> > it in a prog and returning NIL. Is there an easier way?
>>
>> you could also use http://software-lab.de/doc/refN.html#nil or
>> http://software-lab.de/doc/refT.html#t
>>
>> Cheers,
>>
>> Tomas
>> --
>> UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe
>
>
--
UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe


Re: efficiently building a large list

2012-05-31 Thread Joe Bogner
Thanks Tomas, I've started using nil now.

 This is what I came up with to aggregate the data. It actually runs
reasonably well. I'm sharing because I always enjoy reading other people's
picoLisp code so I figure others may as well.

My source file has 4 million rows

: (bench (pivot L 'CustNum))
35.226 sec

# outputs 31,000 rows.

My approach is to load it in as follows:

(class +Invoice)
(rel CustNum (+String))
(rel ProdNum (+String))
(rel Amount (+Number))
(rel Quantity (+Number))

(de Load ()
  (zero N)
  (setq L (make (
  (in "invoices.txt"
(until (eof)
  (setq Line (line) )
  (setq D (mapcar pack (split Line "^I")))
  (link (new
'(+Invoice)
'CustNum (car (nth D 1))
'ProdNum (car (nth D 2))
'Amount (format (car (nth D 3)))
'Quantity (format (car (nth D 4))) )) ) ) ) ) ) T )


I can probably clean this up.  I tinkered around with various approaches
and this was the best I could come up with in a few hours. At first I was
using something like the group from lib.l but found it to be too slow. I
think it was due to the fact that I optimize for a sorted list instead of
scanning for a match in the made list

(de sortedGroup (List Fld)
  (make
(let (Last NIL LastSym NIL)
 (for This List
  (let Key (get This Fld)
(if (<> Last Key)
(prog
(if LastSym (link LastSym))
(off LastSym)
(push 'LastSym Key)) )
 (push 'LastSym This)
 (setq Last Key) ) )
 (link LastSym)) ) )

And here's the piece that ties it all together:

(de pivot (L Fld)
  (let (SL (by '((X) (get X Fld)) sort L) SG (sortedGroup SL Fld))
(out "pivot.txt"
  (for X SG
(let (Amt 0)
  (mapc '((This) (inc 'Amt (: Amount))) (cdr (reverse X)))
  (setq Key (get (car X) Fld))
  (prinl Key "^I" Amt) ) ) ) ) )


(Load)

: (bench (pivot L 'CustNum))
35.226 sec

: (bench (pivot L 'ProdNum))
40.945 sec

It seems the best performance was by sorting, then splitting and then
summing the individual parts. It also makes for a nice report.

Sidenote: At first I thought I was getting better performance by using a
modified version of quicksort off rosetta code, but then I switched it to
the built-in sort and saw considerably better speed.

Thanks for the help everyone

On Thu, May 31, 2012 at 3:37 PM, Tomas Hlavaty  wrote:

> Hi Joe,
>
> > Sidebar: Is there a way to disable the interactive session from
> > printing the return of a statement? For example, if I do a (setq ABC
> > L) where L is a million items, I'd prefer the option of not having all
> > million items print on my console. I've worked around this by wrapping
> > it in a prog and returning NIL. Is there an easier way?
>
> you could also use http://software-lab.de/doc/refN.html#nil or
> http://software-lab.de/doc/refT.html#t
>
> Cheers,
>
> Tomas
> --
> UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe
>


Re: efficiently building a large list

2012-05-31 Thread Tomas Hlavaty
Hi Joe,

> Sidebar: Is there a way to disable the interactive session from
> printing the return of a statement? For example, if I do a (setq ABC
> L) where L is a million items, I'd prefer the option of not having all
> million items print on my console. I've worked around this by wrapping
> it in a prog and returning NIL. Is there an easier way?

you could also use http://software-lab.de/doc/refN.html#nil or
http://software-lab.de/doc/refT.html#t

Cheers,

Tomas
-- 
UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe


Re: efficiently building a large list

2012-05-31 Thread Joe Bogner
Nice! (gc) was the trick.

I noticed lags still when doing a (gc 300) so I bumped it up to (gc 800)
 and it loaded 4M records in 34 seconds

I switched my code to use a (class +Invoice) and do everything in memory.

Calculating a SUM is very quick..

! (bench (let Amt 0 (mapc '((This) (inc 'Amt (: Amount))) List)))
1.152 sec

This is about the same amount of time as SQL server

I also benchmarked against R and Julia. Julia choked at around 2GB of ram
and then just spun and spun. R loaded the file in about the same amount of
time and was able to do a SUM very quickly. It choked when I was doing an
aggregate by customer number.

Sidebar: Is there a way to disable the interactive session from printing
the return of a statement? For example, if I do a (setq ABC L) where L is a
million items, I'd prefer the option of not having all million items print
on my console. I've worked around this by wrapping it in a prog and
returning NIL. Is there an easier way?

The next thing I'm working on is aggregating sums by customer. So far, (by)
and (group) have been too slow. I was pleasantly surprised to see accu run
well

! (bench (mapc '((X) (accu 'Sum X 1)) CustomerNumbers))
13.137 sec

I was expecting that I could use idx as an alternative, but I can't seem to
get it to come back in a reasonable time

As a simple test to get a list of unique customers

! (bench (mapc '((X) (idx 'UniqueCustomers X T)) CustomerNumbers))
163.309 sec

Am I doing something wrong with idx? I'm wondering if accu runs better
because I remember reading that picoLisp internally uses hashes for symbols
and properties and I think accu is using a property of the symbol to store
the values.

Thanks again

Joe





On Thu, May 31, 2012 at 12:16 PM, Alexander Burger wrote:

> > using, e.g. (gc 250) instead, to pre-allocate memory. This avoids that
> > the garbage collector runs again and again.
>
> BTW, 'proc' is very useful here to check the process size:
>
> $ pil +
> : (proc 'pil)
>  PID  PPID  STARTED  SIZE %CPU WCHAN  CMD
> 25575  1831 18:14:55  1512  0.3 -  /usr/bin/picolisp
> /usr/lib/picolisp/lib.l
> -> T
>
> : (gc 250)
> -> 250
>
> : (proc 'pil)
>  PID  PPID  STARTED  SIZE %CPU WCHAN  CMD
> 25575  1831 18:14:55 258512 2.8 -  /usr/bin/picolisp
> /usr/lib/picolisp/lib.l
> -> T
> :
>
> Cheers,
> - Alex
> --
> UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe
>


Re: efficiently building a large list

2012-05-31 Thread Alexander Burger
> using, e.g. (gc 250) instead, to pre-allocate memory. This avoids that
> the garbage collector runs again and again.

BTW, 'proc' is very useful here to check the process size:

$ pil +
: (proc 'pil)
  PID  PPID  STARTED  SIZE %CPU WCHAN  CMD
25575  1831 18:14:55  1512  0.3 -  /usr/bin/picolisp /usr/lib/picolisp/lib.l
-> T

: (gc 250)
-> 250

: (proc 'pil)
  PID  PPID  STARTED  SIZE %CPU WCHAN  CMD
25575  1831 18:14:55 258512 2.8 -  /usr/bin/picolisp /usr/lib/picolisp/lib.l
-> T
:

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe


Re: efficiently building a large list

2012-05-31 Thread Alexander Burger
On Thu, May 31, 2012 at 05:35:09PM +0200, Alexander Burger wrote:
> Each line results in a list of 4 'pack'ed symbols, i.e. 5 cells plus the
> symbols with each at least 1 cell. So 2.4M rows should be at least 2.3
> GB on a 64-bit machine.

Oops! Shame!

2.4e6 times 6 cells on a 64-bit machine are 23040 bytes. That's only
230 MByte, not 2.4 GByte!


Anyway, you should first try:

> The interpreter keeps allocating more and more memory, an additional M
> on each garbage collection. You can speed that up if you run
> 
>(gc 2300)

using, e.g. (gc 250) instead, to pre-allocate memory. This avoids that
the garbage collector runs again and again.

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe


Re: efficiently building a large list

2012-05-31 Thread Alexander Burger
Hi Joe,

> I'd like to do some analysis on a large text file of invoices - for
> example, grouping and summing. Is this an efficient way to build the list?
> 
> The file has 4 million rows in it. The first several hundred thousand load
> quickly and then I notice the time between my checkpoints taking longer (10
> secs+ per 10,000). I ran it for about 10 minutes and killed it and was
> around 2.4M rows

I think this is simply a memory problem.

Each line results in a list of 4 'pack'ed symbols, i.e. 5 cells plus the
symbols with each at least 1 cell. So 2.4M rows should be at least 2.3
GB on a 64-bit machine.

The interpreter keeps allocating more and more memory, an additional M
on each garbage collection. You can speed that up if you run

   (gc 2300)

in the beginning. This will pre-allocate 2.3 G if you have enough RAM.
If the available RAM is smaller, the process will start to trash pages
and slow down a lot.


In general, I think it is not wise to read so much information into a
flat list.



> I played around with other variations of building a list push1, idx, etc
> and as expected they all had a more significant time growth as the list
> grew.

Yes, your way of using 'make' and 'link' is the best.


> I'm experimenting with loading into a db right now to see if that yields
> better results.

Yes, that's better.

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe


Re: efficiently building a large list

2012-05-31 Thread Henrik Sarvell
What happens if you first load the whole thing into memory, ie (let
Invoices (in ... and then work with the Invoices variable?



On Thu, May 31, 2012 at 9:56 PM, Joe Bogner  wrote:
> I'd like to do some analysis on a large text file of invoices - for example,
> grouping and summing. Is this an efficient way to build the list?
>
> The file has 4 million rows in it. The first several hundred thousand load
> quickly and then I notice the time between my checkpoints taking longer (10
> secs+ per 10,000). I ran it for about 10 minutes and killed it and was
> around 2.4M rows
>
> I played around with other variations of building a list push1, idx, etc and
> as expected they all had a more significant time growth as the list grew.
>
>   (make
>     (in "invoices.txt"
>       (until (eof)
>         (setq Line (line) )
>         (at (0 . 1) (prinl Line))
>         (link (mapcar pack (split Line "^I"))) ) ) )
>
>
> My file has:
> Customer #
> Prod #
> Amount
> Quantity
>
> I'm experimenting with loading into a db right now to see if that yields
> better results.
>
> Thanks
> Joe
--
UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe