I wrote:
[...]> You can sniff the first few bytes (which is what is recommended
in the XML 1.0 spec, you can see how they do it there), but making
such an assumption may lead to program failure if the assumption
is incorrect.

   Extensible Markup Language (XML) 1.0 (Third Edition)
   Appendix F Autodetection of Character Encodings
   http://www.w3.org/TR/2004/REC-xml-20040204/#sec-guessing

The suggestions there are pretty usable for files that have nothing
to do with XML.

I neglected to mention that the XML method relies on the beginning of the file starting with "<?xml". In the case of source files for the Lucene project, the beginnings of the files are likely one of three:

   "package..."
   "..."     (some form of whitespace)
   "/*"      (the beginning of an Apache License)
   "<html"   (beginning of HTML file)
   <!DOCTYPE (beginning of HTML file)

It wouldn't be too hard to write a sniffer for this. I think most
all of the Lucene source starts with "package", and if not, it
certainly could.

In grepping through the source I noted nine instances of a lowercase
use of "<!doctype", which isn't valid. This should probably be registered
as a bug. Kinda makes me wonder what's generating that, because when
I run javadoc on my own stuff this doesn't happen.

org/apache/lucene/util/package.html:<!doctype html public "-//w3c//dtd html 4.0 
transitional//en">
org/apache/lucene/index/package.html:<!doctype html public "-//w3c//dtd html 4.0 
transitional//en">
org/apache/lucene/store/package.html:<!doctype html public "-//w3c//dtd html 4.0 
transitional//en">
org/apache/lucene/queryParser/package.html:<!doctype html public "-//w3c//dtd html 4.0 
transitional//en">
org/apache/lucene/search/spans/package.html:<!doctype html public "-//w3c//dtd html 4.0 
transitional//en">
org/apache/lucene/search/package.html:<!doctype html public "-//w3c//dtd html 4.0 
transitional//en">
org/apache/lucene/document/package.html:<!doctype html public "-//w3c//dtd html 4.0 
transitional//en">
org/apache/lucene/analysis/standard/package.html:<!doctype html public "-//w3c//dtd 
html 4.0 transitional//en">
org/apache/lucene/analysis/package.html:<!doctype html public "-//w3c//dtd html 4.0 
transitional//en">

Murray

......................................................................
Murray Altheim                    http://kmi.open.ac.uk/people/murray/
Knowledge Media Institute
The Open University, Milton Keynes, Bucks, MK7 6AA, UK               .

  [International Committee of the Red Cross director] Kraehenbuhl
  pointed out that complying with international humanitarian law
  was "an obligation, not an option", for all sides of the conflict.
  "If these rules or any other applicable rules of international
  humanitarian law are violated, the persons responsible must be
  held accountable for their actions," he said. -- BBC News
  http://news.bbc.co.uk/1/hi/world/middle_east/4027163.stm

 "In my judgment, this new paradigm [the War on Terror] renders
  obsolete Geneva's strict limitations on questioning of enemy
  prisoners and renders quaint some of its provisions [...]
  Your determination [that the Geneva Conventions] does not apply
  would create a reasonable basis in law that [the War Crimes Act]
  does not apply, which would provide a solid defense to any future
  prosecution." -- Alberto Gonzalez, appointed US Attorney General,
  and likely Supreme Court nominee, in a memo to George W. Bush
  http://www.adamhodges.com/WORLD/docs/gonzales_memo.pdf

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to