[PHP-DEV] PHP Unicode strings impl proposal

Rolland Santimano Tue, 23 Aug 2005 08:23:45 -0700

Comments / impl suggestions please.

TIA,
Rolland
--


[1] string substr_replace(string original, string new, int start[, int
length])
Returns string where original[start..length] is replaced with
new. Input args can be arrays, in which case case the operation is:
substr_replace(original[i], new[i], start[i], length[i])
Impl:
The current impl is written in terms of memcpy(), after adjusting
satrt & length correctly. With Unicode input, 'start' & 'length' may
not be aligned with codepoint/grapheme boundaries. If args are mixed
string types, convert to common type.


[2] int substr_count(string text, string token[, int start[, int
length]])
Returns no of occurrences of token in text[start..length]
Impl:
The current impl is around php_memnstr() and can be extended for
Unicode with zend_u_memnstr()


[3] string strtok([string text, ]string separator)
Tokenize string
Impl:
Current impl uses global state, in the form of char ptrs and a
256-char array. Mixed string type input would be converted to common
type, and new global state would have to include initial type of
separator. Tokenizing should honor base+combining sequences.


[4] string strrev(string text)
Returns reversed string equivalent of input.
Impl:
The current impl walks the input string in reverse and copies it one
character at a time. This can be achieved using the U16_NEXT/U16_PREV
macros. Combining characters can be copied together using the
u_getCombiningClass() API.


[5] string str_pad(string text, int length[, string pad[, int
pad_type]])
Returns input string padded on the left and/or right (determined by
pad_type) to specified length with pad string.
Impl:
The impl builds the output string by copying appropriate pad
characters to the left and/or right of the input string.

Q: With STR_PAD_BOTH, lets say 'length' == input 'text' length + 2
(lengths in UChars), but 'pad' text is non-BMP (ie. 2 UChars), then
the 'pad' text can't be added at either end. More generally, the 'pad'
text can't be split in the middle of non-BMP codepts or base+combining
sequences. If such a condn occurs, an error should be returned. Any
other thoughts ?


[6] int similar_text(string str1, string str2[, int percentage])
Returns no of common characters between str1 & str2.
Impl:
The current impl determines common characters by comparing characters
to generate common sequences. Comparisons for Unicode strings should
be done with codepoints.


[7] int levenshtein(string str1, string str2[, int ins_cost, int
rep_cost, int del_cost])
Calculate Levenshtein distance between str1 & str2.

Q: Any gotchas in extending the Levenshtein algo for Unicode ? Should
the ins/del/subst cost be expressed in graphemes or codepts ?

   =================================================================

The foll funcns generally work on ASCII input, and should be made
Unicode-aware. However, should they be converted to process Unicode
input ?

[1] string addslashes(string text)
[2] string stripslashes(string text)
Escape single/double quotes & backslashes with backslashes

[3] string addcslashes(string text, string charlist)
[4] string stripcslashes(string text)
Escape chars < 32 or > 126 with octal sequences, and escape characters
from charlist with backspace.

[5] string strip_tags(string text[, string allowed_tags])
Strip HTML/PHP tags from text

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

[PHP-DEV] PHP Unicode strings impl proposal

Reply via email to