Re: urlify.js blocks out non-English chars - 2nd try?

Gábor Farkas Wed, 19 Jul 2006 02:29:41 -0700

Antonio Cavedoni wrote:
> On 17 Jul 2006, at 8:25, tsuyuki makoto wrote:
>> We Japanese know that we can't transarate Japanese to ASCII.
>> So I want to do it as follows at least.
>> A letter does not disappear and is restored.
>> #FileField and ImageField have same letters disappear problem.
>>
>> def slug_ja(word) :
>>     try :
>>         unicode(word, 'ASCII')
>>         import re
>>         slug = re.sub('[^\w\s-]', '', word).strip().lower()
>>         slug = re.sub('[-\s]+', '-', slug)
>>         return slug
>>     except UnicodeDecodeError :
>>         from encodings import idna
>>         painful_slug = word.strip().lower().decode('utf-8').encode 
>> ('IDNA')
>>         return painful_slug
> 
> I’m not convinced by this approach, but I would suggest using the  
> “punycode” instead of the “idna” encoder anyway. The results don’t  
> include the initial “xn--” marks which are only useful in a domain  
> name, not in a URI path. Also, the “from encodings […]” line appears  
> to be unnecessary on my Python 2.3.5 and 2.4.1 on OSX.
> 
> [[[
>  >>> p = u"perché"
>  >>> from encodings import idna
>  >>> p.encode('idna')
> 'xn--perch-fsa'
>  >>> p.encode('punycode')
> 'perch-fsa'
>  >>> puny = 'perch-fsa'
>  >>> puny.decode('punycode')
> u'perch\xe9'
>  >>> print puny.decode('punycode')
> perché
>  >>> pu = puny.decode('punycode') # it's reversible
>  >>> print pu
> perché
> ]]]
> 
> More on Punycode: http://en.wikipedia.org/wiki/Punycode



i somehow have the feeling that we lost the original idea here a little.

(as far as i understand, by urlify.js we are talking about slug 
auto-generation, please correct me if i'm wrong).

we are auto-generating slugs when it "makes sense". for example, for 
english it makes sense to remove all the non-word stuff, because what 
remains can still be read, be understood, and generally looks fine when 
being a part of the URL.

also, for many languages (hungarian or slavic ones), it also "makes 
sense" to simply drop all the diacritical marks, because the rest can 
still be read, be understood, and looks fine as part of an URL.

but with punycode or whatever-code encoding japanese, what's the point?
what you get will be completely unreadable.. if you only need to 
preserve the submitted data, you don't need to do anything. simply take 
your unicode text, encode it to utf8, url-escape it and use it as a part 
of the url. it will be ok. and on the other side you can url-unescape 
and utf8-decode it and you're back. you will even be able to have ascii 
stuff readably-preserved.

form my point of view, with the current slug-approach, you either can 
convert your text into ascii that "makes sense" or not. if the former, 
then enhancing urlify.js makes sense. if the latter, then it makes no 
sense. imho.


gabor

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers
-~----------~----~----~----~------~----~------~--~---

Re: urlify.js blocks out non-English chars - 2nd try?

Reply via email to