[go-nuts] [show] A lemmatizer for Go

2018-05-27 Thread Matt Sherman
Hi, been a while since I’ve been on the list!

I’ve started a package with tokenizers and lemmatizers for Go, called 
‘Jargon'. It’s intended to be useful for detecting synonyms in text, and 
turning them into their canonical terms.

It’s early so I am looking for feedback: would you find such a package 
useful? What for?

The source & docs are here  and I 
made a playground here .

Cheers,

- Matt

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[go-nuts] Re: Custom types for use with range (Go 2)

2018-08-07 Thread Matt Sherman
Sorry for late reply: yes, it’s sugar, and a first implementation might be 
to have the compiler simply rewrite it like a macro, as in your example.

And I realize that my example was more verbose than need be. We don’t call 
an iterator on arrays, maps, etc, so my example should have been:

for t := range tokenizer { 
   // etc 
}

I.e., need to call .Range(), since the point of the ‘interface' is to let 
the compiler infer how to iterate.

It’s quite a lot less boilerplate, while keeping the intent clear, and 
maybe even preventing some classes of user error.


On Friday, July 20, 2018 at 9:24:18 AM UTC-4, Juliusz Chroboczek wrote:
>
> > for t := range tokenizer.Next() { 
> > // etc 
> > } 
>
> Isn't that just syntactic sugar for 
>
> for t, more := f(); more; t, more = f() { 
> ... 
> } 
>
> ? 
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[go-nuts] Generics as builtin typeclasses

2018-09-04 Thread Matt Sherman
Here’s a riff on generics focused on builtin typeclasses (instead of user 
contracts): https://clipperhouse.com/go-generics-typeclasses/

Feedback welcome.

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[go-nuts] Re: Generics as builtin typeclasses

2018-09-04 Thread Matt Sherman
@Jon mostly to keep the syntactic spirit of the current proposal. I don’t 
have strong opinions on that.

On Tuesday, September 4, 2018 at 3:56:26 PM UTC-4, Jon Conradt wrote:
>
> Why:
>
> func Sum(type T numeric)(x []T) T {
>
>
> and not just
>
> func Sum(x []T type numeric) T {
>
>
> or
>
> func Sum(x []T Numeric) T {
>
>
>
> Jon
>
> On Tuesday, September 4, 2018 at 11:57:02 AM UTC-7, Matt Sherman wrote:
>>
>> Here’s a riff on generics focused on builtin typeclasses (instead of user 
>> contracts): https://clipperhouse.com/go-generics-typeclasses/
>>
>> Feedback welcome.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [go-nuts] Generics as builtin typeclasses

2018-09-04 Thread Matt Sherman
@Matthias I don’t mention it in my post but I think that’d be fine, e.g.:

  type Set(type T comparable) []T
  type OrderedSlice(type T orderable) []T


On Tuesday, September 4, 2018 at 3:52:50 PM UTC-4, Matthias B. wrote:
>
> On Tue, 4 Sep 2018 11:57:02 -0700 (PDT) 
> Matt Sherman > wrote: 
>
> > Here’s a riff on generics focused on builtin typeclasses (instead of 
> > user contracts): https://clipperhouse.com/go-generics-typeclasses/ 
> > 
> > Feedback welcome. 
> > 
>
> The main motivation behind generics has always been type-safe 
> containers of custom types. I'm not seeing this in your proposal. 
>
> MSB 
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [go-nuts] Re: Generics as builtin typeclasses

2018-09-06 Thread Matt Sherman
Thanks Ian, was hoping you’d weigh in.

Perhaps a compromise position would be that these type 
groups/classes/contracts are not language builtins but in the stdlib? 
contracts.Comparable, etc. If we really don’t want to see the dot, we 
import with _.

And, for a first implementation, only the stdlib can define contracts. We 
might find that to be ‘good enough’ (a positive outcome).

The hypothesis (just to reiterate) is that there is a small number of 
contracts that cover the majority of use cases, and so we can exploit that 
fact to minimize new language concepts. 80% benefit for 20% complexity, 
which is how I describe Go’s current type system.


On Thursday, September 6, 2018 at 12:54:24 PM UTC-4, Ian Lance Taylor wrote:
>
> On Thu, Sep 6, 2018 at 8:20 AM,  > wrote: 
> > 
> > As I wasn't happy with some aspects of it, I've rewritten my feedback on 
> the 
> > Go 2 Generics draft and deleted the original gist. Here's the link to 
> the 
> > new gist for anybody who's interested: 
> > https://gist.github.com/alanfo/5da5932c7b60fd130a928ebbace1f251 
> > 
> > This is still based on the type-class idea though I'm now proposing a 
> > simplified contracts approach to go with it rather than trying to make 
> > interfaces fit. It seems to deal easily now with all the examples in the 
> > draft paper though no doubt there will be stuff that it can't do or that 
> > I've overlooked. 
>
> Thanks for writing this.  I never got around to reading the earlier 
> feedback.  And thanks for working out the examples. 
>
> Personally I think an important feature of the current design draft is 
> that it adds relatively few new concepts to the language.  While 
> concepts are of course a new feature, a contract looks like a 
> function.  If you can read a function, you can read a contract.  You 
> don't need to understand a new set of ideas to know what a contract 
> is.  With your proposal, everybody has to learn a new set of 
> predeclared identifiers.  You list 14 new ones, including $struct.  I 
> count 39 existing predeclared identifiers, so this is a significant 
> increase.  Also, of course, the new identifiers don't look like any 
> existing ones, with the $, but perhaps that could be changed.  I would 
> very much prefer to not add so many new names. 
>
> If new features are added to the language, your approach may require 
> new predeclared identifiers, whereas the contract approach will 
> automatically adjust. 
>
> It's worth noting that your suggestion is less powerful than the 
> design draft, in that you can't express the notion of type parameter 
> that must be a channel type or a slice type.  This may not matter very 
> much, because the generic function can always write chan T or []T. 
>
>
> > It looks to me as though, if it requires a contract at all, you might 
> end up writing one from scratch for most generic functions/types you need. 
> Even if the commoner ones could be included in a 'contracts' package in the 
> standard library, you'd still need to import that package and then write 
> stuff like 'contracts.Comparable' which is a bit verbose. 
> > 
> > Even if the present design prove workable, I think writing contracts may 
> prove a bit of a black art and that, if things are at all complicated, some 
> programmers may just give it up and embed the function's code in the 
> contract which defeats the object of having them in the first place! 
>
> I believe we can use tooling to make these operations easier.  For 
> example, assuming we can make contracts work at all, it should be 
> straightforward to write a tool that can minimize a contract given an 
> existing contract definition, and therefore can produce a minimal 
> contract for an existing function body. 
>
> Note that while I think it's important that there be some way to 
> express complex contracts, I think they will be used quite rarely. 
>
>
> > The more I look at this, the more complicated it seems to get :( 
>
> Yes. 
>
> Ian 
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[go-nuts] x/text: Interest in Unicode text segmentation?

2020-04-15 Thread Matt Sherman
Hi, I am working on a tokenizer based on Unicode text segmentation (UAX 29 
). I am wondering if 
there would be an interest in adding range tables for word break categories 
 to 
the x/text or unicode packages. It appears they could be code-gen’d 
alongside the rest of the range tables.

Pardon if this is already being done and I have missed it. I see some 
mention  of 
those categories (e.g. ALetter) in other places.

My code is here . Thanks.

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/2a058556-da51-46d0-a41b-28e323541332%40googlegroups.com.


Re: [go-nuts] x/text: Interest in Unicode text segmentation?

2020-04-16 Thread Matt Sherman
Great. Yes, the data files are here:
https://unicode.org/reports/tr41/tr41-26.html#Props0

I’ve done a proof of concept here: https://github.com/clipperhouse/uax29

To do it properly, I assume we’d want to use the house style here?
https://github.com/golang/text/blob/master/unicode/rangetable/gen.go

On Thu, Apr 16, 2020 at 1:52 PM  wrote:

> Yes that would be interesting. Especially if it can be generated from the
> Unicode raw data upon updates.
>
> On Wed, 15 Apr 2020 at 23:56 Ian Lance Taylor  wrote:
>
>> [ +mpvl ]
>>
>> On Wed, Apr 15, 2020 at 2:30 PM Matt Sherman  wrote:
>> >
>> > Hi, I am working on a tokenizer based on Unicode text segmentation (UAX
>> 29). I am wondering if there would be an interest in adding range tables
>> for word break categories to the x/text or unicode packages. It appears
>> they could be code-gen’d alongside the rest of the range tables.
>> >
>> > Pardon if this is already being done and I have missed it. I see some
>> mention of those categories (e.g. ALetter) in other places.
>> >
>> > My code is here. Thanks.
>> >
>> > --
>> > You received this message because you are subscribed to the Google
>> Groups "golang-nuts" group.
>> > To unsubscribe from this group and stop receiving emails from it, send
>> an email to golang-nuts+unsubscr...@googlegroups.com.
>> > To view this discussion on the web visit
>> https://groups.google.com/d/msgid/golang-nuts/2a058556-da51-46d0-a41b-28e323541332%40googlegroups.com
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/CAMPnbukLN%3DSVkhBQ1TM8TYfp-t1Z3Wxc6MuAi6UZFYYnumU3rw%40mail.gmail.com.


Re: [go-nuts] x/text: Interest in Unicode text segmentation?

2020-04-17 Thread Matt Sherman
Nice. Well, happy to discuss how I might be helpful — implementation, API
design, etc.

For the work I’m doing on UAX 29, the key API is unicode.Is. I am satisfied
with the perf so far. unicode.Is dominates the profiling, but that’s to be
expected, as my scanner is basically a tight loop evaluating rune
categories. Certainly open to using a different trie-driven API.

On Fri, Apr 17, 2020 at 1:47 AM  wrote:

> Most of the x/text packages use tries and not rangetables. These allow
> arbitrary data (as long as it fits in an int) to be associated with runes
> and allow operating on utf8 without having to convert to tunes.
> https://godoc.org/golang.org/x/text/internal/triegen. But that’s not a
> requirement.
>
> The package
> https://godoc.org/golang.org/x/text/internal/gen/bitfield converts Go
> structs to ints and can be used to pack the rune data in a convenient way.
>
> Furthermore Package
> https://godoc.org/golang.org/x/text/internal/ucd
> can be used for reading UCD files
>
> And Package
> https://godoc.org/golang.org/x/text/internal/gen
> can be used to generate Go tables other than the trie and include
> utilities to generate canonical x/text files, such as including the Unicode
> and CLDR versions.
>
> The top-level file gen.go is used to orchestrate building x/text and
> captured dependencies between packages.
>
> I may have some designs laying around for the API.
>
> On Thu, 16 Apr 2020 at 21:46 Matt Sherman  wrote:
>
>> Great. Yes, the data files are here:
>> https://unicode.org/reports/tr41/tr41-26.html#Props0
>>
>> I’ve done a proof of concept here: https://github.com/clipperhouse/uax29
>>
>> To do it properly, I assume we’d want to use the house style here?
>> https://github.com/golang/text/blob/master/unicode/rangetable/gen.go
>>
>> On Thu, Apr 16, 2020 at 1:52 PM  wrote:
>>
>>> Yes that would be interesting. Especially if it can be generated from
>>> the Unicode raw data upon updates.
>>>
>>> On Wed, 15 Apr 2020 at 23:56 Ian Lance Taylor  wrote:
>>>
>>>> [ +mpvl ]
>>>>
>>>> On Wed, Apr 15, 2020 at 2:30 PM Matt Sherman 
>>>> wrote:
>>>> >
>>>> > Hi, I am working on a tokenizer based on Unicode text segmentation
>>>> (UAX 29). I am wondering if there would be an interest in adding range
>>>> tables for word break categories to the x/text or unicode packages. It
>>>> appears they could be code-gen’d alongside the rest of the range tables.
>>>> >
>>>> > Pardon if this is already being done and I have missed it. I see some
>>>> mention of those categories (e.g. ALetter) in other places.
>>>> >
>>>> > My code is here. Thanks.
>>>> >
>>>> > --
>>>> > You received this message because you are subscribed to the Google
>>>> Groups "golang-nuts" group.
>>>> > To unsubscribe from this group and stop receiving emails from it,
>>>> send an email to golang-nuts+unsubscr...@googlegroups.com.
>>>> > To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/golang-nuts/2a058556-da51-46d0-a41b-28e323541332%40googlegroups.com
>>>> .
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/CAMPnbukOfdaV_D9P1cChmWrN%2BT1kf2OSOAgyXmRf-3PBakbOSw%40mail.gmail.com.


[go-nuts] [ANN] Unicode text segmentation

2020-05-07 Thread Matt Sherman
Hi gophers, I’ve implemented Unicode text segmentation for 
Go: https://github.com/clipperhouse/uax29/words

It tokenizes text into words, sentences or graphemes according to the Unicode 
spec . I’d been tokenizing text in ad 
hoc ways, and then learned that there is a Unicode standard.

Hopefully useful for you, feedback welcome. (I’m also talking to @mpvl 
about how such functionality might be useful in x/text.)

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/69cec677-b512-45a1-a9d6-592302d878ae%40googlegroups.com.


[go-nuts] Re: [ANN] Unicode text segmentation

2020-05-07 Thread Matt Sherman
Sorry, bad link. Here it is: https://github.com/clipperhouse/uax29

On Thursday, May 7, 2020 at 12:06:18 PM UTC-4, Matt Sherman wrote:
>
> Hi gophers, I’ve implemented Unicode text segmentation for Go: 
> https://github.com/clipperhouse/uax29/words
>
> It tokenizes text into words, sentences or graphemes according to the Unicode 
> spec <https://unicode.org/reports/tr29/>. I’d been tokenizing text in ad 
> hoc ways, and then learned that there is a Unicode standard.
>
> Hopefully useful for you, feedback welcome. (I’m also talking to @mpvl 
> about how such functionality might be useful in x/text.)
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/bbb890f3-ee1c-41f3-8468-d90b971b1977%40googlegroups.com.


[go-nuts] [unicode] Missing Katakana runes in rangetable?

2022-06-27 Thread Matt Sherman
Hi there, I stumbled across a surprising discovery that 
unicode.Is(unicode.Katakana, 'ー') returns false. This is code point U+30FC, 
and appears in the Katakana code block 
. Looking at the rangetable 
,
 
it’s appears to be skipped, along with 30FB, if I am reading correctly.

Would anyone know if this is intentional? I recognize that these tables are 
generated, though I admit I could not find the generator for Scripts 
categorization.

Thanks.

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/30e3a49b-4a9f-4539-8150-6b2d3e254a20n%40googlegroups.com.


[go-nuts] Re: [unicode] Missing Katakana runes in rangetable?

2022-06-27 Thread Matt Sherman
Ah, I was barking up the wrong tree on this, please disregard. It’s an 
extending character, which by itself (I infer) is not categorized as 
Katakana.

On Monday, June 27, 2022 at 10:51:07 PM UTC-4 Matt Sherman wrote:

> Hi there, I stumbled across a surprising discovery that 
> unicode.Is(unicode.Katakana, 'ー') returns false. This is code point U+30FC, 
> and appears in the Katakana code block 
> <https://www.compart.com/en/unicode/block/U+30A0>. Looking at the 
> rangetable 
> <https://github.com/golang/go/blob/release-branch.go1.18/src/unicode/tables.go#L4709-L4723>,
>  
> it’s appears to be skipped, along with 30FB, if I am reading correctly.
>
> Would anyone know if this is intentional? I recognize that these tables 
> are generated, though I admit I could not find the generator for Scripts 
> categorization.
>
> Thanks.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/1c5c2f3b-8915-4fb2-877d-6b57b9913a61n%40googlegroups.com.