[Rails I18n] yet another unicode string hacks

Dae San Hwang Wed, 14 Jun 2006 03:31:58 -0700

Hi everyone.

I'm implementing yet another unicode string hacks. I'm trying to  
rewire String class so that it will act like Ruby 2.0 String class.  
(see http://redhanded.hobix.com/inspect/futurismUnicodeInRuby.html)


String literals will act as byte buffers, just as they used to.  
However, when creating string object by using constructor, you can  
optionally specify the encoding of the input string.

  String.new("\352\260\200", "utf-8")

Default value of the encoding is nil if $KCODE is not set or set to  
"none". Default encoding is 'utf-8' if $KCODE == 'u'.  If encoding is  
nil, string objects will act just like old ruby strings we all know  
and love.  If encoding is set to a specific charset, string's  
instance methods will act more reasonably according to its encoding.  
Following is the summary of what I'm thinking:

  String#encoding gives character encoding name (e.g. "utf-8")
  String#[index] returns character string if encoding is set. If the  
encoding is not set, it returns fixnum as it used to.
  String#[] is always encoding aware if encoding is set.
  String#slice is always byte buffer operation regardless of the  
encoding.
  String#size always returns the number of bytes in the string.
  String#length returns the number of characters in the string  
according to the encoding specified. If the encoding is not set, it's  
same as String#size.
  String#+ will return utf-8 encoded string if two string's encoding  
does not match.

  *, <<, <=>, ==, =~, capitalize, casecmp, center, chomp, chop,  
count, delete, downcase, each, each_line, eql?, gsub, match, succ,  
scan, split, strip, sub, upcase, upto will be all encoding aware if  
encoding is set.

The reason I'm differentiating between 'size' and 'length' is because  
some libraries (like rails) depend on them returning the byte size of  
the string. Maybe we can establish a customs that 'size' for byte  
size and 'length' for the number of characters. Same reasoning goes  
for '[]' and 'slice'.

For now, it will support only utf-8 encoding as ruby's regexp doesn't  
seem to support encodings other than ascii and utf-8. (I could use  
iconv to convert encoding internally to utf-8 for each method call,  
but at the moment, I think it's probably too costly and not worth it.)

I would love to get some feedback on this. I really want to create  
something that I can depend on until Ruby 2.0 releases.

Thanks!

Daesan


Dae San Hwang
[EMAIL PROTECTED]



_______________________________________________
Railsi18n-discussion mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/railsi18n-discussion

[Rails I18n] yet another unicode string hacks

Reply via email to