> Is the output of file(1) appropriate for this purpose?
> Shouldn't your sample file also be sent as UTF-8?

it should be.  for example since
        ; echo ☺ | file
        stdin: short UTF text   # sic
one would expect that echo ☺ | file -m
would yield text/plain; charset=utf-8.

> file(1) speaks only mine type but not charset.

file does sometimes return a character set.

minooka;  grep -n charset /sys/src/cmd/file.c | sed 1q
594:    0xfeff0000,     0xffffffff,     "utf-32be\n",
        "text/plain charset=utf-32be",

it doesn't make sense to me for file to be
inconsistent.  if file emits character sets, it
should always emit character sets.

i'm not sure why the ';' is dropped.  this would force
a client to parse the output.

> it is difficult or impossible to determine charset from a few japanese  
> letters.

plan 9 is a utf-8 system.  if we have files in another
character set that's not a proper subset, most plan 9
tools will not work properly on them.

also, since it is hard to guess the charset of particular
japanese-encoded files, it would probablly be good to
force their encoding with html decoration.

- erik

Reply via email to