Comments / impl suggestions please. TIA, Rolland --
[1] string substr_replace(string original, string new, int start[, int length]) Returns string where original[start..length] is replaced with new. Input args can be arrays, in which case case the operation is: substr_replace(original[i], new[i], start[i], length[i]) Impl: The current impl is written in terms of memcpy(), after adjusting satrt & length correctly. With Unicode input, 'start' & 'length' may not be aligned with codepoint/grapheme boundaries. If args are mixed string types, convert to common type. [2] int substr_count(string text, string token[, int start[, int length]]) Returns no of occurrences of token in text[start..length] Impl: The current impl is around php_memnstr() and can be extended for Unicode with zend_u_memnstr() [3] string strtok([string text, ]string separator) Tokenize string Impl: Current impl uses global state, in the form of char ptrs and a 256-char array. Mixed string type input would be converted to common type, and new global state would have to include initial type of separator. Tokenizing should honor base+combining sequences. [4] string strrev(string text) Returns reversed string equivalent of input. Impl: The current impl walks the input string in reverse and copies it one character at a time. This can be achieved using the U16_NEXT/U16_PREV macros. Combining characters can be copied together using the u_getCombiningClass() API. [5] string str_pad(string text, int length[, string pad[, int pad_type]]) Returns input string padded on the left and/or right (determined by pad_type) to specified length with pad string. Impl: The impl builds the output string by copying appropriate pad characters to the left and/or right of the input string. Q: With STR_PAD_BOTH, lets say 'length' == input 'text' length + 2 (lengths in UChars), but 'pad' text is non-BMP (ie. 2 UChars), then the 'pad' text can't be added at either end. More generally, the 'pad' text can't be split in the middle of non-BMP codepts or base+combining sequences. If such a condn occurs, an error should be returned. Any other thoughts ? [6] int similar_text(string str1, string str2[, int percentage]) Returns no of common characters between str1 & str2. Impl: The current impl determines common characters by comparing characters to generate common sequences. Comparisons for Unicode strings should be done with codepoints. [7] int levenshtein(string str1, string str2[, int ins_cost, int rep_cost, int del_cost]) Calculate Levenshtein distance between str1 & str2. Q: Any gotchas in extending the Levenshtein algo for Unicode ? Should the ins/del/subst cost be expressed in graphemes or codepts ? ================================================================= The foll funcns generally work on ASCII input, and should be made Unicode-aware. However, should they be converted to process Unicode input ? [1] string addslashes(string text) [2] string stripslashes(string text) Escape single/double quotes & backslashes with backslashes [3] string addcslashes(string text, string charlist) [4] string stripcslashes(string text) Escape chars < 32 or > 126 with octal sequences, and escape characters from charlist with backspace. [5] string strip_tags(string text[, string allowed_tags]) Strip HTML/PHP tags from text -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php