Hello, On 06/30/2014 06:23 AM, assafgor...@gmail.com wrote:
I'd like to suggest a patch to allow seq to generate letter sequences.
Attached is an improved implementation for the same functionality: ( http://lists.gnu.org/archive/html/coreutils/2014-06/msg00090.html )
With this patch, 'seq' can print letters of alphabets in the current locale (or user-specified language). Examples: # print all letters in the current alphabet seq --alphabet seq -a # print the first 10 letters in the current alphabet seq -a 10 # print the letters of the Russian alphabet # (assuming the locale is installed) LC_ALL=ru_RU.utf-8 seq -a # print the letters of the hebrew alphabet # (assuming the current locale supports UTF-8 or # other encoding supported by gnulib/libunistring) seq --alphabet=he
The new data takes ~5100 bytes (instead of previous >15KB). It requires (one time) encoding of a 'database' textual file (included) using a perl script (included). Conceptually similar to the unicode tables, this only needs to be done when an alphabet is updated. The alphabets are encoded in 'src/alphabets_data.h'. The decoder is in 'src/alphabets.{c,h}' . The added functionality is in few new functions in 'src/seq.c' . === If you think that this is an acceptable feature (at least conceptually), then I'd be happy to discuss further details, such as which languages to include, and implementation suggestions (for example, should this be moved to gnulib?). Are there any important encoding issues I might have missed (the code tries to be as portable as possible, internally storing UCS values, converting them to UTF8 with 'u8-uctomb()', then printing them with 'u8-strconv-to-locale()' - so no assumption about the active encoding). Should there be an interface for multi-letter output (e.g. "aa" after "z"), === Regarding Bernhard's comment: On 07/03/2014 02:18 AM, Bernhard Voelker wrote:
The user could let the shell produce the input: $ printf "%c" {a..z} | seq -s ' ' --alpha=- 2 2 6 b d f thus picking the Nth character from the input. ;-)
I don't think this example is portable, as "{a..z}" is not in POSIX sh, so can't be used in scripting. However, more generally, it's easy to generate ranges of unicode symbols if their value is known: # Arabic letters (unicode block 0x627 - 0x64a) seq $((0x627)) $((0x64a)) | xargs env printf '\\\\u%04x\\\\n' | xargs env printf# Cyrillic letters (unicode block 0x410 - 0x42f)
seq $((0x410)) $((0x42f)) | xargs env printf '\\\\u%04x\\\\n' | xargs env printf But the problem is that official alphabets letters for each language are very irregular: For example, few letters in the Arabic block aren't official ordinal letters (they are valid alphabet symbols for letter under certain conditions). Also, in some languages, a letter is actually two unicode symbols (e.g. in Czech, "Ch" is a single letter, in addition to the "C" and "H" letters). In non-english latin based languages, besides the simple ASCII letters of A-Z, there are additional symbols which are not sequential unicode values. Whether this feature is desired or not in coreutils is one question. But if it is (for more languages than English), then I think simple "ranges" will not suffice. Comments are welcomed, -gordon
seq_alphabet.2014-07-08.patch.xz
Description: application/xz