subject:"Re\: \[Factor\-talk\] Dedupe by Slot"

Re: [Factor-talk] Dedupe by Slot

2016-11-18 Thread John Benediktsson

The *.extras vocabularies are places to incubate new words. We haven't done the 
best job of documenting them and promoting to the core/basis vocabularies. 

> On Nov 18, 2016, at 8:21 AM, Alexander Ilin  wrote:
> 
> Hello, Björn!
> 
> 18.11.2016, 18:25, "Björn Lindqvist" :
>> USE: sequences.extras
>> [ id>> ] sort-with [ id>> ] group-by [ second first ] map
> 
>  I could not find `group-by` using the Browser. Grepping the source tree, it 
> turned up in `grouping.extras`.
> 
>> USE: math.statistics
>> [ id>> ] collect-by [ nip first ] { } assoc>map
> 
>  `collect-by` is a useful thing, got to keep it in mind. I remember 
> implementing something very similar not too long ago.
> 
>> It's not as efficient as what John committed though. :) Maybe we
>> should try and clean it up somehow? If we put all group
>> by/aggregation/uniquifying words in the same vocab it would be more
>> easily discoverable?
> 
>  That may be a good idea. I'm regularly rereading the documentation for 
> `sequences` and `sets`, so maybe a little pointer to adjacent vocabs (at 
> least *.extras) could be made.
> 
> ---=--- 
> Александр
> 
> --
> ___
> Factor-talk mailing list
> Factor-talk@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/factor-talk

--
___
Factor-talk mailing list
Factor-talk@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/factor-talk

Re: [Factor-talk] Dedupe by Slot

2016-11-18 Thread Alexander Ilin

Hello, Björn!

18.11.2016, 18:25, "Björn Lindqvist" :
> USE: sequences.extras
> [ id>> ] sort-with [ id>> ] group-by [ second first ] map

  I could not find `group-by` using the Browser. Grepping the source tree, it 
turned up in `grouping.extras`.

> USE: math.statistics
> [ id>> ] collect-by [ nip first ] { } assoc>map

  `collect-by` is a useful thing, got to keep it in mind. I remember 
implementing something very similar not too long ago.

> It's not as efficient as what John committed though. :) Maybe we
> should try and clean it up somehow? If we put all group
> by/aggregation/uniquifying words in the same vocab it would be more
> easily discoverable?

  That may be a good idea. I'm regularly rereading the documentation for 
`sequences` and `sets`, so maybe a little pointer to adjacent vocabs (at least 
*.extras) could be made.

---=--- 
 Александр

--
___
Factor-talk mailing list
Factor-talk@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/factor-talk

Re: [Factor-talk] Dedupe by Slot

2016-11-18 Thread Alexander Ilin

John, thank you very much! : )Really helpful stuff! 18.11.2016, 18:13, "John Benediktsson" :P.S., Hah I should have called it unique-by, it's too early in the morning! P.P.S., I committed this word into sets.extras, with one small change besides the name which is to size the hash-set capacity by the length of the sequence. On Fri, Nov 18, 2016 at 6:54 AM, John Benediktsson  wrote:Maybe something like this: : duplicates-by ( seq quot: ( elt -- key ) -- seq' )    HS{ } clone '[ @ _ ?adjoin ] filter ; inline Then you can use it:     IN: scratchpad { 1 2 3 4 5 } [ 2/ ] duplicates-by    { 1 2 4 }     IN: scratchpad sequence-of-tuples [ hash>> ] duplicates-by It would keep the first element that matches by key and drop all the subsequent ones.   On Fri, Nov 18, 2016 at 6:36 AM, Alexander Ilin  wrote:Hello, all!  I have an interesting little task for you today.  Let's say you have a sequence of tuples, and you want to remove all tuples with duplicate ids, so that in the new sequence there is only one tuple with each id.  Here's my solution:TYPED: dedupe-by-hash ( seq: sequence -- seq: sequence )    dup [ hash>> ] map >hash-set [        [ hash>> ] dip        [ in? ] [ delete ] 2bi    ] curry filter ;  This is not the first time I'm solving this task, and I begun to wonder - is there something similar in the Factor library?  Is this the simplest/most efficient implementation?  Is it possible to generalize it to work for any slot like so:TYPED: dedupe-by-slot ( seq slot -- seq ) ?  If this code is not in the standard library, how about adding it? Seems pretty useful, and not too trivial.  What do you say?---=--- Александр--___Factor-talk mailing listFactor-talk@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/factor-talk,--,___Factor-talk mailing listFactor-talk@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/factor-talk  ---=---Александр --
___
Factor-talk mailing list
Factor-talk@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/factor-talk

Re: [Factor-talk] Dedupe by Slot

2016-11-18 Thread Björn Lindqvist

2016-11-18 15:36 GMT+01:00 Alexander Ilin :
> Hello, all!
>
>   I have an interesting little task for you today.
>
>   Let's say you have a sequence of tuples, and you want to remove all tuples 
> with duplicate ids, so that in the new sequence there is only one tuple with 
> each id.
>
>   Here's my solution:
>
> TYPED: dedupe-by-hash ( seq: sequence -- seq: sequence )
> dup [ hash>> ] map >hash-set [
> [ hash>> ] dip
> [ in? ] [ delete ] 2bi
> ] curry filter ;
>
>   This is not the first time I'm solving this task, and I begun to wonder - 
> is there something similar in the Factor library?

Everything is in the Factor library. :) What you are describing is
like a group by operation in sql. So if you have:

TUPLE: person name id ;

You can use either:

USE: sequences.extras
[ id>> ] sort-with [ id>> ] group-by [ second first ] map

Or

USE: math.statistics
[ id>> ] collect-by [ nip first ] { } assoc>map

If you want tiebreakers, like choosing the person with the
alphabetically first name if more than one share id, you can implement
it like this:

USE: slots.syntax
[ slots{ id name } ] sort-with [ id>> ] group-by [ second first ] map

It's not as efficient as what John committed though. :) Maybe we
should try and clean it up somehow? If we put all group
by/aggregation/uniquifying words in the same vocab it would be more
easily discoverable?


--
mvh Björn Lindqvist

--
___
Factor-talk mailing list
Factor-talk@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/factor-talk

Re: [Factor-talk] Dedupe by Slot

2016-11-18 Thread John Benediktsson

Maybe something like this:

: duplicates-by ( seq quot: ( elt -- key ) -- seq' )
HS{ } clone '[ @ _ ?adjoin ] filter ; inline

Then you can use it:

IN: scratchpad { 1 2 3 4 5 } [ 2/ ] duplicates-by
{ 1 2 4 }

IN: scratchpad sequence-of-tuples [ hash>> ] duplicates-by

It would keep the first element that matches by key and drop all the
subsequent ones.



On Fri, Nov 18, 2016 at 6:36 AM, Alexander Ilin  wrote:

> Hello, all!
>
>   I have an interesting little task for you today.
>
>   Let's say you have a sequence of tuples, and you want to remove all
> tuples with duplicate ids, so that in the new sequence there is only one
> tuple with each id.
>
>   Here's my solution:
>
> TYPED: dedupe-by-hash ( seq: sequence -- seq: sequence )
> dup [ hash>> ] map >hash-set [
> [ hash>> ] dip
> [ in? ] [ delete ] 2bi
> ] curry filter ;
>
>   This is not the first time I'm solving this task, and I begun to wonder
> - is there something similar in the Factor library?
>
>   Is this the simplest/most efficient implementation?
>
>   Is it possible to generalize it to work for any slot like so:
>
> TYPED: dedupe-by-slot ( seq slot -- seq ) ?
>
>   If this code is not in the standard library, how about adding it? Seems
> pretty useful, and not too trivial.
>
>   What do you say?
>
> ---=---
>  Александр
>
> 
> --
> ___
> Factor-talk mailing list
> Factor-talk@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/factor-talk
>
--
___
Factor-talk mailing list
Factor-talk@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/factor-talk

Re: [Factor-talk] Dedupe by Slot

Re: [Factor-talk] Dedupe by Slot

Re: [Factor-talk] Dedupe by Slot

Re: [Factor-talk] Dedupe by Slot

Re: [Factor-talk] Dedupe by Slot

5 matches

Site Navigation

Mail list logo

Footer information