Re: chemistry character set

2018-03-15 Thread Mike Dewhirst

On 16/03/2018 12:40 AM, Peter of the Norse wrote:
I ran into a similar problem with one of my projects; people were 
using Greek and Cyrillic letters and other symbols to be cute.  It’s 
all in English, but they kept doing things like using ß for B and ¥ 
for Y.  And then expecting to be able to search the way they way it 
looks.  So I am doing the cleanup in the .save() method.  My only 
advice is to use 
https://docs.python.org/3/library/stdtypes.html?highlight=translate#str.translate instead 
of multiple replaces.  If you make the translation map a global 
variable, it is much faster.


Wow!

Thank you. Ain't Python marvellous!

Mike



- Peter of the Norse

On Feb 15, 2018, at 5:55 AM, Mike Dewhirst > wrote:



On 15/02/2018 10:19 PM, Hanne Moa wrote:

On 2018-02-06 12:51, Mike Dewhirst wrote:
Thank you. I think this is where we probably need to go. I asked 
the original question because I'm hoping the project will reach a 
tipping point and start to accumulate a growing number of 
multilingual users. We have our first multinational user but they 
only operate in the English speaking world so no pressure at the 
moment.
There can be no sort that satisfies every possible language at the 
same time. For instance, Norwegian sorts "ä" as "a" and "ö" as "o". 
Swedish sorts them after "å" as separate letters: åäö. Then there is 
Turkish where "i" sorts differently from "ı" (dotless i).


That is interesting! It says to me that longer term I need to think 
about special sort orders for different languages. A bit above my pay 
grade just now.


I've worked the greek letter prefixes by using a separate sort field 
only seen by the software. A simple replace('α', 'a') lets me adjust 
sort order for the moment. That may work with diacritics for some 
time. I'll be driven by actual requirements until I hit a brick wall 
and then I'll ask for PhD help :)


Thanks

Mike

I'm guessing chemistry names follow their own rules, you could see 
how hard it is to make your own os collation table and use that? 
Then everything running on the server would sort by the same rules.

HM



--
You received this message because you are subscribed to the Google 
Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, 
send an email to django-users+unsubscr...@googlegroups.com 
.
To post to this group, send email to django-users@googlegroups.com 
.

Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-users/bfd1633b-a651-fec9-6f2a-86efac8d2e8c%40dewhirst.com.au.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google 
Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send 
an email to django-users+unsubscr...@googlegroups.com 
.
To post to this group, send email to django-users@googlegroups.com 
.

Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-users/65B863CA-7C6C-4DC7-83FC-DAE87D2F6E8C%40Radio1190.org 
.

For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups "Django 
users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-users+unsubscr...@googlegroups.com.
To post to this group, send email to django-users@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-users/8296b698-9681-093b-ea81-52675583f1cb%40dewhirst.com.au.
For more options, visit https://groups.google.com/d/optout.


Re: chemistry character set

2018-03-15 Thread Peter of the Norse
I ran into a similar problem with one of my projects; people were using Greek 
and Cyrillic letters and other symbols to be cute.  It’s all in English, but 
they kept doing things like using ß for B and ¥ for Y.  And then expecting to 
be able to search the way they way it looks.  So I am doing the cleanup in the 
.save() method.  My only advice is to use 
https://docs.python.org/3/library/stdtypes.html?highlight=translate#str.translate
 instead of multiple replaces.  If you make the translation map a global 
variable, it is much faster. 

- Peter of the Norse

> On Feb 15, 2018, at 5:55 AM, Mike Dewhirst  wrote:
> 
>> On 15/02/2018 10:19 PM, Hanne Moa wrote:
>>> On 2018-02-06 12:51, Mike Dewhirst wrote:
>>> Thank you. I think this is where we probably need to go. I asked the 
>>> original question because I'm hoping the project will reach a tipping point 
>>> and start to accumulate a growing number of multilingual users. We have our 
>>> first multinational user but they only operate in the English speaking 
>>> world so no pressure at the moment.
>> There can be no sort that satisfies every possible language at the same 
>> time. For instance, Norwegian sorts "ä" as "a" and "ö" as "o". Swedish sorts 
>> them after "å" as separate letters: åäö. Then there is Turkish where "i" 
>> sorts differently from "ı" (dotless i).
> 
> That is interesting! It says to me that longer term I need to think about 
> special sort orders for different languages. A bit above my pay grade just 
> now.
> 
> I've worked the greek letter prefixes by using a separate sort field only 
> seen by the software. A simple replace('α', 'a') lets me adjust sort order 
> for the moment. That may work with diacritics for some time. I'll be driven 
> by actual requirements until I hit a brick wall and then I'll ask for PhD 
> help :)
> 
> Thanks
> 
> Mike
> 
>> I'm guessing chemistry names follow their own rules, you could see how hard 
>> it is to make your own os collation table and use that? Then everything 
>> running on the server would sort by the same rules.
>> HM
>> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Django users" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to django-users+unsubscr...@googlegroups.com.
> To post to this group, send email to django-users@googlegroups.com.
> Visit this group at https://groups.google.com/group/django-users.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/django-users/bfd1633b-a651-fec9-6f2a-86efac8d2e8c%40dewhirst.com.au.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-users+unsubscr...@googlegroups.com.
To post to this group, send email to django-users@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-users/65B863CA-7C6C-4DC7-83FC-DAE87D2F6E8C%40Radio1190.org.
For more options, visit https://groups.google.com/d/optout.


Re: chemistry character set

2018-02-15 Thread Mike Dewhirst

On 15/02/2018 10:19 PM, Hanne Moa wrote:

On 2018-02-06 12:51, Mike Dewhirst wrote:

Thank you. I think this is where we probably need to go. I asked the original 
question because I'm hoping the project will reach a tipping point and start to 
accumulate a growing number of multilingual users. We have our first 
multinational user but they only operate in the English speaking world so no 
pressure at the moment.

There can be no sort that satisfies every possible language at the same time. For instance, Norwegian sorts "ä" as "a" and 
"ö" as "o". Swedish sorts them after "å" as separate letters: åäö. Then there is Turkish where "i" sorts 
differently from "ı" (dotless i).


That is interesting! It says to me that longer term I need to think 
about special sort orders for different languages. A bit above my pay 
grade just now.


I've worked the greek letter prefixes by using a separate sort field 
only seen by the software. A simple replace('α', 'a') lets me adjust 
sort order for the moment. That may work with diacritics for some time. 
I'll be driven by actual requirements until I hit a brick wall and then 
I'll ask for PhD help :)


Thanks

Mike


I'm guessing chemistry names follow their own rules, you could see how hard it 
is to make your own os collation table and use that? Then everything running on 
the server would sort by the same rules.
HM



--
You received this message because you are subscribed to the Google Groups "Django 
users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-users+unsubscr...@googlegroups.com.
To post to this group, send email to django-users@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-users/bfd1633b-a651-fec9-6f2a-86efac8d2e8c%40dewhirst.com.au.
For more options, visit https://groups.google.com/d/optout.


Re: chemistry character set

2018-02-15 Thread Hanne Moa
On 2018-02-06 12:51, Mike Dewhirst wrote:
> Thank you. I think this is where we probably need to go. I asked the
> original question because I'm hoping the project will reach a tipping
> point and start to accumulate a growing number of multilingual users. We
> have our first multinational user but they only operate in the English
> speaking world so no pressure at the moment.

There can be no sort that satisfies every possible language at the same
time. For instance, Norwegian sorts "ä" as "a" and "ö" as "o". Swedish
sorts them after "å" as separate letters: åäö. Then there is Turkish
where "i" sorts differently from "ı" (dotless i).

I'm guessing chemistry names follow their own rules, you could see how
hard it is to make your own os collation table and use that? Then
everything running on the server would sort by the same rules.


HM

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-users+unsubscr...@googlegroups.com.
To post to this group, send email to django-users@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-users/d1bbe4f2-9036-d9d0-7217-38bba4d508aa%40gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: chemistry character set

2018-02-06 Thread Mike Dewhirst

On 6/02/2018 10:08 PM, Jason wrote:
At first glance, I thought this was an easy problem to have, but 
apparently it is certainly not!  I came across an Oracle whitepaper 
 that 
describes how to sort your linguistic data, and you might find some 
clues there to adapt with your current db. 
http://ilmarkerm.blogspot.com/2009/07/using-linguistic-indexes-for-sorting-in.html is 
an old post describing linguistic indexes in postgres and mysql, but 
the dbs used are almost 8 years out of date, so you might have to 
update the syntax to your current version.


Thank you. I think this is where we probably need to go. I asked the 
original question because I'm hoping the project will reach a tipping 
point and start to accumulate a growing number of multilingual users. We 
have our first multinational user but they only operate in the English 
speaking world so no pressure at the moment.


I really appreciate that pointer

Cheers

Mike



On Monday, February 5, 2018 at 6:56:00 PM UTC-5, Mike Dewhirst wrote:

Chemical names start with both upper and lower case as well as Greek
characters. Chemical names also exist in multiple non-western
non-latin
languages.

To get lists of chemicals sorting more or less "correctly" I
currently
slugify with allow_unicode=True.

This for example gets tert-Butyl... sorted nicely among names
starting
with upper-case T.

Unfortunately the α-terpineol or beta this or  ε that all sink to the
end of the list instead of sorting into the A, B or Es.

My google-fu indicates I can sort on a property but that is slow.
I have
thought about tweaking slugify to include a table of equivalences
between Greek and Western chars but that doesn't necessarily cater
for
non-Western character sets. Maybe an ever expanding table of
equivalences?

Thanks for any ideas ...

Mike

--
You received this message because you are subscribed to the Google 
Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send 
an email to django-users+unsubscr...@googlegroups.com 
.
To post to this group, send email to django-users@googlegroups.com 
.

Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-users/1a1ad0d7-f6b5-4397-beb4-0f15964cabf2%40googlegroups.com 
.

For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups "Django 
users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-users+unsubscr...@googlegroups.com.
To post to this group, send email to django-users@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-users/fa164b82-3a6d-928a-3da3-9770681eebaf%40dewhirst.com.au.
For more options, visit https://groups.google.com/d/optout.


Re: chemistry character set

2018-02-06 Thread Mike Dewhirst

On 6/02/2018 10:27 PM, Julio Biason wrote:

Hi Mike,

One thing that occurs me is that you can override the model save() to 
update another field -- one that the user doesn't have access. On that 
function, you will write a new field, say `sortable_name` in which 
you'll transfor the chemical name into something that will appear in 
the proper order, like converting alphas to A, betas to B, etc.


Agreed. This is what I have done to date ...

In substance.save() ...

self.slug = greek_tweak(self.name, allow_unicode=True)

substance.slug is not displayed anywhere and nor is it used in urls 
because there can be many substances with the same name. And 
greek_tweak() ...


def greek_tweak(name, allow_unicode=True):
name = name.replace('α', 'a').replace('β', 'b').replace('γ', 'g')
name = name.replace('δ', 'd').replace('ε', 'e')
return slugify(name, allow_unicode)

And back in substance Meta ...

ordering = ['slug']




When you request the list of chemicals by name order, you actually use 
the `sortable_name` field, which will have all the conversions in place.


On Mon, Feb 5, 2018 at 9:55 PM, Mike Dewhirst > wrote:


Chemical names start with both upper and lower case as well as
Greek characters. Chemical names also exist in multiple
non-western non-latin languages.

To get lists of chemicals sorting more or less "correctly" I
currently slugify with allow_unicode=True.

This for example gets tert-Butyl... sorted nicely among names
starting with upper-case T.

Unfortunately the α-terpineol or beta this or  ε that all sink to
the end of the list instead of sorting into the A, B or Es.

My google-fu indicates I can sort on a property but that is slow.
I have thought about tweaking slugify to include a table of
equivalences between Greek and Western chars but that doesn't
necessarily cater for non-Western character sets. Maybe an ever
expanding table of equivalences?

Thanks for any ideas ...

Mike

-- 
You received this message because you are subscribed to the Google

Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to django-users+unsubscr...@googlegroups.com
.
To post to this group, send email to django-users@googlegroups.com
.
Visit this group at https://groups.google.com/group/django-users
.
To view this discussion on the web visit

https://groups.google.com/d/msgid/django-users/4160ee4d-8b36-1118-1bec-2ba8ab40d891%40dewhirst.com.au

.
For more options, visit https://groups.google.com/d/optout
.




--
*Julio Biason*,Sofware Engineer
*AZION*  | Deliver. Accelerate. Protect.
Office: +55 51 3083 8101   |  Mobile: +55 51 
_99907 0554_

--
You received this message because you are subscribed to the Google 
Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send 
an email to django-users+unsubscr...@googlegroups.com 
.
To post to this group, send email to django-users@googlegroups.com 
.

Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-users/CAEM7gE1%3DDRwnaOjJhf63EPXzjKGv083M-NNwCN7%2BfhbgeZRz-Q%40mail.gmail.com 
.

For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups "Django 
users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-users+unsubscr...@googlegroups.com.
To post to this group, send email to django-users@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-users/358d35a3-86d6-52e0-2c34-7382b73b7342%40dewhirst.com.au.
For more options, visit https://groups.google.com/d/optout.


Re: chemistry character set

2018-02-06 Thread Julio Biason
Hi Mike,

One thing that occurs me is that you can override the model save() to
update another field -- one that the user doesn't have access. On that
function, you will write a new field, say `sortable_name` in which you'll
transfor the chemical name into something that will appear in the proper
order, like converting alphas to A, betas to B, etc.

When you request the list of chemicals by name order, you actually use the
`sortable_name` field, which will have all the conversions in place.

On Mon, Feb 5, 2018 at 9:55 PM, Mike Dewhirst  wrote:

> Chemical names start with both upper and lower case as well as Greek
> characters. Chemical names also exist in multiple non-western non-latin
> languages.
>
> To get lists of chemicals sorting more or less "correctly" I currently
> slugify with allow_unicode=True.
>
> This for example gets tert-Butyl... sorted nicely among names starting
> with upper-case T.
>
> Unfortunately the α-terpineol or beta this or  ε that all sink to the end
> of the list instead of sorting into the A, B or Es.
>
> My google-fu indicates I can sort on a property but that is slow. I have
> thought about tweaking slugify to include a table of equivalences between
> Greek and Western chars but that doesn't necessarily cater for non-Western
> character sets. Maybe an ever expanding table of equivalences?
>
> Thanks for any ideas ...
>
> Mike
>
> --
> You received this message because you are subscribed to the Google Groups
> "Django users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to django-users+unsubscr...@googlegroups.com.
> To post to this group, send email to django-users@googlegroups.com.
> Visit this group at https://groups.google.com/group/django-users.
> To view this discussion on the web visit https://groups.google.com/d/ms
> gid/django-users/4160ee4d-8b36-1118-1bec-2ba8ab40d891%40dewhirst.com.au.
> For more options, visit https://groups.google.com/d/optout.
>



-- 
*Julio Biason*, Sofware Engineer
*AZION*  |  Deliver. Accelerate. Protect.
Office: +55 51 3083 8101   |  Mobile: +55 51
*99907 0554*

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-users+unsubscr...@googlegroups.com.
To post to this group, send email to django-users@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-users/CAEM7gE1%3DDRwnaOjJhf63EPXzjKGv083M-NNwCN7%2BfhbgeZRz-Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: chemistry character set

2018-02-06 Thread Jason
At first glance, I thought this was an easy problem to have, but apparently 
it is certainly not!  I came across an Oracle whitepaper 

 that 
describes how to sort your linguistic data, and you might find some clues 
there to adapt with your current db.  
http://ilmarkerm.blogspot.com/2009/07/using-linguistic-indexes-for-sorting-in.html
 is 
an old post describing linguistic indexes in postgres and mysql, but the 
dbs used are almost 8 years out of date, so you might have to update the 
syntax to your current version.

On Monday, February 5, 2018 at 6:56:00 PM UTC-5, Mike Dewhirst wrote:
>
> Chemical names start with both upper and lower case as well as Greek 
> characters. Chemical names also exist in multiple non-western non-latin 
> languages. 
>
> To get lists of chemicals sorting more or less "correctly" I currently 
> slugify with allow_unicode=True. 
>
> This for example gets tert-Butyl... sorted nicely among names starting 
> with upper-case T. 
>
> Unfortunately the α-terpineol or beta this or  ε that all sink to the 
> end of the list instead of sorting into the A, B or Es. 
>
> My google-fu indicates I can sort on a property but that is slow. I have 
> thought about tweaking slugify to include a table of equivalences 
> between Greek and Western chars but that doesn't necessarily cater for 
> non-Western character sets. Maybe an ever expanding table of equivalences? 
>
> Thanks for any ideas ... 
>
> Mike 
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-users+unsubscr...@googlegroups.com.
To post to this group, send email to django-users@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-users/1a1ad0d7-f6b5-4397-beb4-0f15964cabf2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


chemistry character set

2018-02-05 Thread Mike Dewhirst
Chemical names start with both upper and lower case as well as Greek 
characters. Chemical names also exist in multiple non-western non-latin 
languages.


To get lists of chemicals sorting more or less "correctly" I currently 
slugify with allow_unicode=True.


This for example gets tert-Butyl... sorted nicely among names starting 
with upper-case T.


Unfortunately the α-terpineol or beta this or  ε that all sink to the 
end of the list instead of sorting into the A, B or Es.


My google-fu indicates I can sort on a property but that is slow. I have 
thought about tweaking slugify to include a table of equivalences 
between Greek and Western chars but that doesn't necessarily cater for 
non-Western character sets. Maybe an ever expanding table of equivalences?


Thanks for any ideas ...

Mike

--
You received this message because you are subscribed to the Google Groups "Django 
users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-users+unsubscr...@googlegroups.com.
To post to this group, send email to django-users@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-users/4160ee4d-8b36-1118-1bec-2ba8ab40d891%40dewhirst.com.au.
For more options, visit https://groups.google.com/d/optout.