I wrote:
[...]> You can sniff the first few bytes (which is what is recommended
in the XML 1.0 spec, you can see how they do it there), but making
such an assumption may lead to program failure if the assumption
is incorrect.

   Extensible Markup Language (XML) 1.0 (Third Edition)
   Appendix F Autodetection of Character Encodings

The suggestions there are pretty usable for files that have nothing
to do with XML.

I neglected to mention that the XML method relies on the beginning of the file starting with "<?xml". In the case of source files for the Lucene project, the beginnings of the files are likely one of three:

   "..."     (some form of whitespace)
   "/*"      (the beginning of an Apache License)
   "<html"   (beginning of HTML file)
   <!DOCTYPE (beginning of HTML file)

It wouldn't be too hard to write a sniffer for this. I think most
all of the Lucene source starts with "package", and if not, it
certainly could.

In grepping through the source I noted nine instances of a lowercase
use of "<!doctype", which isn't valid. This should probably be registered
as a bug. Kinda makes me wonder what's generating that, because when
I run javadoc on my own stuff this doesn't happen.

org/apache/lucene/util/package.html:<!doctype html public "-//w3c//dtd html 4.0 
org/apache/lucene/index/package.html:<!doctype html public "-//w3c//dtd html 4.0 
org/apache/lucene/store/package.html:<!doctype html public "-//w3c//dtd html 4.0 
org/apache/lucene/queryParser/package.html:<!doctype html public "-//w3c//dtd html 4.0 
org/apache/lucene/search/spans/package.html:<!doctype html public "-//w3c//dtd html 4.0 
org/apache/lucene/search/package.html:<!doctype html public "-//w3c//dtd html 4.0 
org/apache/lucene/document/package.html:<!doctype html public "-//w3c//dtd html 4.0 
org/apache/lucene/analysis/standard/package.html:<!doctype html public "-//w3c//dtd 
html 4.0 transitional//en">
org/apache/lucene/analysis/package.html:<!doctype html public "-//w3c//dtd html 4.0 


