Re: [Factor-talk] Dedupe by Slot
The *.extras vocabularies are places to incubate new words. We haven't done the best job of documenting them and promoting to the core/basis vocabularies. > On Nov 18, 2016, at 8:21 AM, Alexander Ilinwrote: > > Hello, Björn! > > 18.11.2016, 18:25, "Björn Lindqvist" : >> USE: sequences.extras >> [ id>> ] sort-with [ id>> ] group-by [ second first ] map > > I could not find `group-by` using the Browser. Grepping the source tree, it > turned up in `grouping.extras`. > >> USE: math.statistics >> [ id>> ] collect-by [ nip first ] { } assoc>map > > `collect-by` is a useful thing, got to keep it in mind. I remember > implementing something very similar not too long ago. > >> It's not as efficient as what John committed though. :) Maybe we >> should try and clean it up somehow? If we put all group >> by/aggregation/uniquifying words in the same vocab it would be more >> easily discoverable? > > That may be a good idea. I'm regularly rereading the documentation for > `sequences` and `sets`, so maybe a little pointer to adjacent vocabs (at > least *.extras) could be made. > > ---=--- > Александр > > -- > ___ > Factor-talk mailing list > Factor-talk@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/factor-talk -- ___ Factor-talk mailing list Factor-talk@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/factor-talk
Re: [Factor-talk] Dedupe by Slot
Hello, Björn! 18.11.2016, 18:25, "Björn Lindqvist": > USE: sequences.extras > [ id>> ] sort-with [ id>> ] group-by [ second first ] map I could not find `group-by` using the Browser. Grepping the source tree, it turned up in `grouping.extras`. > USE: math.statistics > [ id>> ] collect-by [ nip first ] { } assoc>map `collect-by` is a useful thing, got to keep it in mind. I remember implementing something very similar not too long ago. > It's not as efficient as what John committed though. :) Maybe we > should try and clean it up somehow? If we put all group > by/aggregation/uniquifying words in the same vocab it would be more > easily discoverable? That may be a good idea. I'm regularly rereading the documentation for `sequences` and `sets`, so maybe a little pointer to adjacent vocabs (at least *.extras) could be made. ---=--- Александр -- ___ Factor-talk mailing list Factor-talk@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/factor-talk
Re: [Factor-talk] Dedupe by Slot
John, thank you very much! : )Really helpful stuff! 18.11.2016, 18:13, "John Benediktsson":P.S., Hah I should have called it unique-by, it's too early in the morning! P.P.S., I committed this word into sets.extras, with one small change besides the name which is to size the hash-set capacity by the length of the sequence. On Fri, Nov 18, 2016 at 6:54 AM, John Benediktsson wrote:Maybe something like this: : duplicates-by ( seq quot: ( elt -- key ) -- seq' ) HS{ } clone '[ @ _ ?adjoin ] filter ; inline Then you can use it: IN: scratchpad { 1 2 3 4 5 } [ 2/ ] duplicates-by { 1 2 4 } IN: scratchpad sequence-of-tuples [ hash>> ] duplicates-by It would keep the first element that matches by key and drop all the subsequent ones. On Fri, Nov 18, 2016 at 6:36 AM, Alexander Ilin wrote:Hello, all! I have an interesting little task for you today. Let's say you have a sequence of tuples, and you want to remove all tuples with duplicate ids, so that in the new sequence there is only one tuple with each id. Here's my solution:TYPED: dedupe-by-hash ( seq: sequence -- seq: sequence ) dup [ hash>> ] map >hash-set [ [ hash>> ] dip [ in? ] [ delete ] 2bi ] curry filter ; This is not the first time I'm solving this task, and I begun to wonder - is there something similar in the Factor library? Is this the simplest/most efficient implementation? Is it possible to generalize it to work for any slot like so:TYPED: dedupe-by-slot ( seq slot -- seq ) ? If this code is not in the standard library, how about adding it? Seems pretty useful, and not too trivial. What do you say?---=--- Александр--___Factor-talk mailing listFactor-talk@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/factor-talk,--,___Factor-talk mailing listFactor-talk@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/factor-talk ---=---Александр -- ___ Factor-talk mailing list Factor-talk@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/factor-talk
Re: [Factor-talk] Dedupe by Slot
2016-11-18 15:36 GMT+01:00 Alexander Ilin: > Hello, all! > > I have an interesting little task for you today. > > Let's say you have a sequence of tuples, and you want to remove all tuples > with duplicate ids, so that in the new sequence there is only one tuple with > each id. > > Here's my solution: > > TYPED: dedupe-by-hash ( seq: sequence -- seq: sequence ) > dup [ hash>> ] map >hash-set [ > [ hash>> ] dip > [ in? ] [ delete ] 2bi > ] curry filter ; > > This is not the first time I'm solving this task, and I begun to wonder - > is there something similar in the Factor library? Everything is in the Factor library. :) What you are describing is like a group by operation in sql. So if you have: TUPLE: person name id ; You can use either: USE: sequences.extras [ id>> ] sort-with [ id>> ] group-by [ second first ] map Or USE: math.statistics [ id>> ] collect-by [ nip first ] { } assoc>map If you want tiebreakers, like choosing the person with the alphabetically first name if more than one share id, you can implement it like this: USE: slots.syntax [ slots{ id name } ] sort-with [ id>> ] group-by [ second first ] map It's not as efficient as what John committed though. :) Maybe we should try and clean it up somehow? If we put all group by/aggregation/uniquifying words in the same vocab it would be more easily discoverable? -- mvh Björn Lindqvist -- ___ Factor-talk mailing list Factor-talk@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/factor-talk
Re: [Factor-talk] Dedupe by Slot
Maybe something like this: : duplicates-by ( seq quot: ( elt -- key ) -- seq' ) HS{ } clone '[ @ _ ?adjoin ] filter ; inline Then you can use it: IN: scratchpad { 1 2 3 4 5 } [ 2/ ] duplicates-by { 1 2 4 } IN: scratchpad sequence-of-tuples [ hash>> ] duplicates-by It would keep the first element that matches by key and drop all the subsequent ones. On Fri, Nov 18, 2016 at 6:36 AM, Alexander Ilinwrote: > Hello, all! > > I have an interesting little task for you today. > > Let's say you have a sequence of tuples, and you want to remove all > tuples with duplicate ids, so that in the new sequence there is only one > tuple with each id. > > Here's my solution: > > TYPED: dedupe-by-hash ( seq: sequence -- seq: sequence ) > dup [ hash>> ] map >hash-set [ > [ hash>> ] dip > [ in? ] [ delete ] 2bi > ] curry filter ; > > This is not the first time I'm solving this task, and I begun to wonder > - is there something similar in the Factor library? > > Is this the simplest/most efficient implementation? > > Is it possible to generalize it to work for any slot like so: > > TYPED: dedupe-by-slot ( seq slot -- seq ) ? > > If this code is not in the standard library, how about adding it? Seems > pretty useful, and not too trivial. > > What do you say? > > ---=--- > Александр > > > -- > ___ > Factor-talk mailing list > Factor-talk@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/factor-talk > -- ___ Factor-talk mailing list Factor-talk@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/factor-talk