Re: The Sort Problem

Damian Conway Sun, 15 Feb 2004 23:17:27 -0800

Uri persisted:

>   DC> How do you know that the values of %lookup are strings? DC> How
>   would the optimizer know?
>
> because that would be the default comparison and the extracted key value
> would be stringified unless some other marker is used. most sorts are on
> strings so this would be a useful huffman and removal of a redundancy.

Leaving aside your assertion about the commonest type of sort (which I
don't see how you can have reliable evidence about), we're not talking
about "markers"; these are *operators* that ensure that the keys extracted
are of a particular static type.


>   DC> Unary C<+> *is* the "float cast thingy"!
>
> hmm, so + is float

No. C<+> is number.


> but int is needed otherwise? int is more commonly a sort key than float.
> it just feels asymetrical with one having a symbol and the other a named
> operator.

Sorry, but that's the way the language works. The more general and usual
conversion (to number, not to integer) has the shorter name.


>   DC> If you want to force numeric comparison of keys you explicitly
>   DC> cast each key to number using unary C<+> or C<int>. If you want to
>   DC> force stringific comparison you explicitly cast the key to string
>   DC> using unary C<~>.
>
> or ~ is the default if nothing else is specified. it matches up with cmp
> being the default comparison as you agreed above.

Okay, if you want to think of it that way, fine.


>   DC> If you don't explicitly cast either way, C<sort> just DWIMs by
>   DC> looking at the actual type of the keys returned by the extractor.
>   DC> If any of them isn't a number, it defaults to C<cmp>.
>
> that doesn't work for me. the GRT can't scan all keys before it decides
> on what comparison operator is needed. the GRT needs to merge keys as
> they are extracted and it needs to do the correct conversion to a byte
> string then and not later. you can't DWIM this away if you want the
> optimization.

EXACTLY!!! So, if you want the GRT to kick in, you have to make sure the
return type of your block is appropriate. If you don't give the compiler that
hint, you don't get the optimized sorting.


>   DC> Because I wanted to show a plain old two-parameter block being
>   used as DC> a *comparator* (not a one-parameter block being used as a
>   key DC> extractor).
>
> that seems like extra redundant code just to mark a callback vs a
> extract block.

It's just an *example*, for heavens sake.

> there should be a cleaner way to do that.

There is. If it's just a callback that is already declared to take two parameters, and you want to pass $^a and $^b in that order, then you can just write:

sort &just_guess, @unsorted;

or, if it's one of many criteria, you write:

        sort [ { -M }, { .{owner} }, &just_guess,
             ] @unsorted;

> maybe a null extract key in the pair?

Unnecessary. See above.

>   >> or it could be a sort criteria list followed by 2 refs to input
>   >> records.
>
>   DC> Only if the first array ref contains nothing but Criterion
>   objects.
>
> but what defines them as criteria objects?

Their *type*. Remember this:

type KeyExtractor ::= Code(Any) returns Any;

type Comparator ::= Code(Any, Any) returns Int;

    type Criterion    ::= KeyExtractor
                        | Comparator Pair(KeyExtractor, Comparator)
                        ;

    type Criteria     ::= Criterion
                        | Array of Criterion
                        ;

multi sub *sort(Criteria $by = {$^a cmp $^b}: [EMAIL PROTECTED]) {...}

???

The criteria have to be of a very specific type.

> they look just like records which could be sorted as well.

The point is that an array is only a list C<sort> criteria if every one of its elements is a one- or two-parameter block, or a Pair of blocks. And the blocks have to have very specific signatures too.

The *only* time there's going to be any ambiguity is if you wanted to sort a list of arrays of C<sort> criteria...not a very common occurrence, I suspect. ;-)


> i don't see any syntactical markers that {} means criteria block vs a
> hash ref.

Because there aren't any. Blocks are blocks and hashes are hashes, and in Perl 6 we have a very clear set of syntactic rules that determines which is which.

> DWIM guessing isn't good enough for me here.

This *isn't* DWIM guessing. This is block parsing and static typing.

> sort in p5 already had some issues with that IIRC.

Only because Perl 5 has issues with hashes vs blocks.
Perl 6 fixes those issues.


> but the callback sub knows it has two params since it has to be written
> that way.

But it *isn't* a callback sub in my example!!! It's a call to subroutine inside a block. And the *block* is the comparator C<sort> sees. The parameters of the block are $^a and $^b and it's a two parameter block (and therefore a valid comparator) precisely *because* it has those two parameters.

Look, FORGET about that example. It was obviously a bad example. It's
caused enough pain (to me if no-one else).

Here's another example (all praise be to Rod!):

    @teams = sort [
                  # 1 parameter so it's a key extractor...
                  {+ $^team->{WinPct} },
                  # 2 parameters so it's a comparator...
                  { ($^a->{Conference} eq $^b->{Conference}
                        ? ($^b->{ConfWinPct} <=> $^a->{ConfWinPct} ||
                           $^b->{ConfWon}    <=> $^a->{ConfWon}    ||
                           $^a->{ConfLost}   <=> $^b->{ConfLost})
                    : ($^a->{Division} eq $^b->{Division}
                        ? ($^b->{DivWinPct} <=> $^a->{DivWinPct} ||
                           $^b->{DivWon}    <=> $^a->{DivWon}    ||
                           $^a->{DivLost}   <=> $^b->{DivLost})
                    : 0
                    )
                  },
                  # 1 parameter each so all three are key extractors...
                  {+ $^team->{Won}  } is descending,
                  {+ $^team->{Lost} },
                  {+ $^team->{GamesPlayed} } is descending,
              ] @teams;

> sort always calls the code block with 2 params.

No, it calls a comparator block with two *arguments*.

It recognizes the block as being a comparator in the first place precisely *because* it has two parameters.


>   DC> But you *can't* apply C<is descending> to a Code reference.
>
> then how did you apply 'is insensitive'?

I applied it to the *block* (i.e. a closure definition).
That's not the same thing as a Code reference.


> what i am saying is i think that you need to go back to the drawing
> board to find a clean syntax to mark those flags.

No, I think we have it...just put the appropriate trait on the extractor
*block* when you define it.


> i just don't like the reverse args concept at all. it is not
> semantically useful to sorting. sorts care about ascending and
> descending, not a vs b in args.

The problem is that sometimes comparator logic is suffiently complex that
a single comparator needs to have a "bit each way", as in Rod's
football team comparator above.


> but i am glad you did since it brings up this issue. i don't think we
> should allow any args there at all as they are not needed.

The *are* needed. They're the very things that tell the compiler that
the block has two parameters and may therefore be used as a comparator.

> some sort of descending marker reverses the args before the call.

As I explained above, that's not always sufficient. If you have to resort to coding a comparator at all (instead of just using a key extractor) you're almost certainly usign some criterion that *isn't* amenable to a simple descent marker. Because it if *was* amenable, you'd have used a simple key extractor instead.


> also in the GRT, descending causes special data munging so the keys will
> sort in descending order so that has to be passed on to the guts somehow.

Yes. As a trait on the key extractor.


>   DC> @sorted = sort {-M} @unsorted;
>   >> that still wants to be cached somehow as -M is expensive.
>
>   DC> It *is* cached. It's a one-parameter block. So its a key
>   extractor. So DC> it automagically caches the keys it extracts.
>
> ??? who and what caches it? it will get called again and again for the
> same record.

No it won't. That's what "caching" *means*.

> where is the definition that 1 param code blocks do caching?

They don't. I'm saying that when C<sort> *calls* a key extractor block
for a particular element, C<sort> will cache the result.

>   DC> Key extractors will always cache.
>
> so this is a key extraction feature about caching.

Yep. Perhaps I should have said:

Key extractions will always be cached.

> the ST and GRT don't need caching so that is a waste if those are used.

ST and GRT *are* caching algorithms. Each of them caches a munged key and
thereafter just uses that (i.e. rather than recomputing the key). That's
the whole *point* of them.

My point is that when a sorting algorithm calls a key extractor, it will
cache the results...in whatever way is appropriate to the algorithm's needs.

> but they can't always cache.

Their results *can* always be cached. Their results *must* always be cached. It makes no sense to call a key extractor block more than once for any particular element being sorted.


> it depends on the implementation and possibly at runtime (selecting
> orchish, ST or GRT based on some other criteria).

Yes. The form of caching is an implicit part of the algorithm C<sort> chooses.


>   DC> If you'd gone insane and particularly wanted to do it that way,
>   you'd DC> need something like:
>
>   DC>         @sorted = sort {state %M ; %M{$_} //= -M} @unsorted;
>
>   DC> to ensure the cache persisted between calls to the key extractor.
>
> and will that get cleared before each sort?

No. And that's exactly the problem. That's why you'd be insane to do the
caching yourself. That's why the caching *has* to be internal to C<sort>.


>   >> maybe the fact that the compiler knows -M returns a float can be
>   >> used to mark it internally and the explicit float isn't needed here.
>
>   DC> Exactly.
>
> ok for this case but not for data from a record.

As I said before, it depends if the record is statically typed. If not, then you either have to explicitly coerce it with C<~> or C<+> or C<int>, or you let C<sort> work it out polymorphically at run-time (and pay for that by forgoing some possible optimizations).


>   >> but data from a user record will need to be marked as float as the
>   >> compiler can't tell.
>
>   DC> It *can* tell if the elements are typed. But, yes, most of the
>   DC> time if you want to ensure numeric comparison you will explicitly
>   DC> prefix with a C<+> to give the compiler a hint. Otherwise C<sort>
>   DC> will have to fall back on looking at the keys that are extracted
>   DC> rand working out at run-time which type of comparison to use (kinda
>   DC> like the smartmatch operator does).
>
> only in the ST. the orcish does its compares on the fly without a full
> prescan. and the GRT can't prescan as it mungs and merges keys on the
> fly.

Then, naturally, if you don't provide the appropriate type information
C<sort> won't be able to select Orcish or GRT. Not that missing the Orcish
matters, since it's merely a jitted ST (which *will* still be available).

Damian

Re: The Sort Problem

Reply via email to