snowball ASCII stemmer configuration

2020-06-16 Thread Peter Eisentraut
While I was updating the snowball code, I noticed something strange. In src/backend/snowball/Makefile: # first column is language name and also name of dictionary for not-all-ASCII # words, second is name of dictionary for all-ASCII words # Note order dependency: use of some other language as

Re: snowball ASCII stemmer configuration

2020-06-16 Thread Tom Lane
Peter Eisentraut writes: > There are two cases where these two columns are not the same: > hindi english \ > russian english \ > The second one is old; the first one I added using the second one as > example. But I wonder what the rationale for this is. Maybe for h

Re: snowball ASCII stemmer configuration

2020-06-16 Thread Oleg Bartunov
On Tue, Jun 16, 2020 at 4:53 PM Tom Lane wrote: > Peter Eisentraut writes: > > There are two cases where these two columns are not the same: > > > hindi english \ > > russian english \ > > > The second one is old; the first one I added using the second one as > > exam

Re: snowball ASCII stemmer configuration

2020-06-16 Thread Tom Lane
I wrote: > Peter Eisentraut writes: >> Moreover, AFAIK, the following other languages do not use Latin-based >> alphabets: >> arabic arabic \ >> greek greek \ >> nepali nepali \ >> tamil tamil \ > Hmm. I think all of those entries are ones that got a

Re: snowball ASCII stemmer configuration

2020-06-16 Thread Mark Dilger
> On Jun 16, 2020, at 7:37 AM, Tom Lane wrote: > > I wrote: >> Peter Eisentraut writes: >>> Moreover, AFAIK, the following other languages do not use Latin-based >>> alphabets: > >>> arabic arabic \ >>> greek greek \ >>> nepali nepali \ >>> tamil tamil

Re: snowball ASCII stemmer configuration

2020-06-16 Thread Tom Lane
Mark Dilger writes: > I am a bit surprised to see that you are right about this, because non-latin > languages often have transliteration/romanization schemes for writing the > language in the Latin alphabet, developed before computers had wide spread > adoption of non-ASCII character sets, and

Re: snowball ASCII stemmer configuration

2020-06-19 Thread Peter Eisentraut
On 2020-06-16 16:37, Tom Lane wrote: After further reflection, I think these are indeed mistakes and we should change them all. The argument for the Russian/English case, AIUI, is "if we come across an all-ASCII word, it is most certainly not Russian, and the most likely Latin-based language is

Re: snowball ASCII stemmer configuration

2020-06-19 Thread Tom Lane
Peter Eisentraut writes: > Do we *have* to have an ASCII stemmer that corresponds to an actual > language? Couldn't we use the simple stemmer or no stemmer at all? > In my experience, ASCII text in, say, Russian or Greek will typically be > acronyms or brand names or the like, and there doesn't