Re: RFC 283 (v1) C in array context should return a histogram

Bennett Todd Tue, 26 Sep 2000 17:34:04 -0700
2000-09-26-20:29:22 Paris Sinclair:
> On Tue, 26 Sep 2000, Bennett Todd wrote:
> >     $hist[ord($_)]++ for split //, $string;
> 
> But would technique work with unicode?

Beats me, I've never tried programming against unicode, as I don't
speak any other language than english I don't expect I will do so in
the future either. I expect the answer to your question depends
partly on details of the encoding, and partly on the implementations
of split and ord in a unicoded-infested world.

Could be someone would try and feed some kanji through it or
something and produce a sparse array a trillion bytes long, for all
I know. If you're worried about a scary sparse alphabet, switch the
[] to {} and use a hash:-).

> What if I am just counting some Bulgarian characters? Most
> encodings put these in the extended ascii range. Making an array
> of 250 items for a count of 5 items isn't going to be more
> efficient.

I'd expect it would; an array of 250 items is teensy.

> Also, it requires jumping through more hoops, and doing
> more conversions, to figure out which index is which letter.

Yup, I'm a sick little monkey who truly doesn't care about anything
other than US-ASCII, and doesn't mind the mildly extended encodings
like ISO 8859-1 because they include ASCII as their 7-bit subset; if
I get a text file and it's not in ASCII I can't read it anyway, so I
toss it.

> A table could be built, but if it maps to an array index, based on
> ord(), then I couldn't support both KOI-8 and windows cyrillic
> encodings in the same @hist structure.

If you're gonna have both KOI-8 and windows cyrillic encodings in
the same single string being passed to split, I am really really
glad I don't share your problems. I'll stand way, way back, thanks.

If you're getting from different sources, you could map them as you
consolidate them. But I think most folks would go for a single
common encoding before they even began examining the contents.

> Using a hash, the only limits are the more general language
> supports in Perl, and I can still convert and store KOI8 and
> cp1251, and store the results without needing to know which coding
> it originated in; only needing to have a symbol for the character.

If the purpose for including histogram-generation as a builtin to
perl, as a context-triggered side-effect of tr///, is to support
i18n, let's do please make that very explicit in the RFC. If we
don't, the requirements to make that work might not get thought
completely through and the desired i18n might not actually work. Oh,
and if the implementation is going to have to do all the right
brilliant stuff for i18n in the face of every conceivable encoding,
I expect it's not gonna be faster than the hash-based equivalent
construct:

        $hist{ord($_)}++ for split //, $string;

which only requires that split// and ord do something appropriately
consistent across encodings.

But when people claim i18n benefits for things I tend to just go
away to my corner and get quiet, since I don't planning on doing
multilingual code or work with multilingual data, I don't feel
qualified to hold an opinion.

-Bennett
PGP signature
Re: RFC 283 (v1) C in array context should return a histogram

Reply via email to