[patch] Get rid of InsetLaTeXAccent - finally

Georg Baum Sun, 21 Jan 2007 13:25:10 -0800

As announced I did some further work on the removal of InsetLaTeXAccent. The
lyx2lyx stuff is now finished and working, and I also got a replacement for
LaTeX output up and running (but it is very slow - don't try to output the
user guide, it will take many minutes).


It works like this:

lib/unicodesymbols contains a list of ucs4 code points and LaTeX commands.
This list is used to output characters that cannot be encoded in the
current encoding and replaces the hardcoded stuff for euro & co in
Paragraph::Pimpl::simpleTeXSpecialChars. It is also used for the characters
that were handled by InsetLaTeXAccent formerly, although I only added those
code points manually that the attached test file uses. Later we can
complete this list using similar existing databases like that of plastex.
The nice thing of this list is that we do not need to care anymore which
encoding has which characters - we are using iconv to get that information
(see below).

José had a different idea some time ago: Use a post processor on the
generated .tex file to convert it e.g. from utf8 to the proper LaTeX
commands. I don't think that this is feasible, since such a postprocessor
would need to parse the TeX code (e.g. scanning for \selectlanguage or
\inputencoding), and this can fail too easily. Therefore I think that we
should do this in LyX, because we have all needed information there (but of
course use an external character database, so that users can extend it if
needed).

The Encoding class queries iconv for a translation table from ucs4 to all
255 code points on startup (therefore you'll get a lot of iconv error
messages on startup - ignore them for now). This is done so that the
Encoding class can tell whether a ucs4 character can be encoded in this
encoding or not.

In order to properly output accents on spaces I had to remove an (IMO ugly)
hack that tries to do clever things with spaces and font changes (see
http://bugzilla.lyx.org/show_bug.cgi?id=1428). I am not sure whether this
is a file format change, but IMO we should not try to be clever here and
simply output what the user entered. If I don't want \textbf{xxx }, then I
should not mark the space as bold face.

Comments?
If this is OK in general I'll do some profiling, resolve the speed issues
and put this in.


Georg

latexaccent-all-257.lyx
Description: application/lyx

latexaccent-all.lyx
Description: application/lyx

Index: src/paragraph.h
===================================================================
--- src/paragraph.h	(Revision 16797)
+++ src/paragraph.h	(Arbeitskopie)
@@ -112,7 +112,7 @@ public:
 	void write(Buffer const &, std::ostream &, BufferParams const &,
 		   depth_type & depth) const;
 	///
-	void validate(LaTeXFeatures &) const;
+	void validate(LaTeXFeatures &, LyXFont const &) const;
 
 	///
 	int startTeXParParams(BufferParams const &, odocstream &, bool) const;
Index: src/insets/insettext.C
===================================================================
--- src/insets/insettext.C	(Revision 16797)
+++ src/insets/insettext.C	(Arbeitskopie)
@@ -338,8 +338,13 @@ int InsetText::docbook(Buffer const & bu
 
 void InsetText::validate(LaTeXFeatures & features) const
 {
-	for_each(paragraphs().begin(), paragraphs().end(),
-		 bind(&Paragraph::validate, _1, ref(features)));
+	ParagraphList::const_iterator const begin = paragraphs().begin();
+	ParagraphList::const_iterator const end = paragraphs().end();
+	for (ParagraphList::const_iterator it = begin; it !=end; ++it) {
+		LyXFont const outerfont =
+			outerFont(std::distance(begin, it), paragraphs());
+		it->validate(features, outerfont);
+	}
 }
 
 
Index: src/insets/Makefile.am
===================================================================
--- src/insets/Makefile.am	(Revision 16797)
+++ src/insets/Makefile.am	(Arbeitskopie)
@@ -77,8 +77,6 @@ libinsets_la_SOURCES = \
 	insetindex.h \
 	insetlabel.C \
 	insetlabel.h \
-	insetlatexaccent.C \
-	insetlatexaccent.h \
 	insetline.C \
 	insetline.h \
 	insetmarginal.h \
Index: src/encoding.h
===================================================================
--- src/encoding.h	(Revision 16797)
+++ src/encoding.h	(Arbeitskopie)
@@ -14,9 +14,8 @@
 #define ENCODING_H
 
 #include <map>
-#include <string>
 
-#include "support/types.h"
+#include "support/docstring.h"
 
 namespace lyx {
 
@@ -29,16 +28,25 @@ public:
 	Encoding() {}
 	///
 	Encoding(std::string const & n, std::string const & l,
-	         std::string const & i)
-		: Name_(n), LatexName_(l), iconvName_(i)
-	{
-	}
+	         std::string const & i);
 	///
 	std::string const & name() const { return Name_; }
 	///
 	std::string const & latexName() const { return LatexName_; }
 	///
 	std::string const & iconvName() const { return iconvName_; }
+	/**
+	 * Convert \p c to something that LaTeX can understand.
+	 * This is either the character itself (if it is representable
+	 * in this encoding), or a LaTeX macro.
+	 * If the character is not representable in this encoding, but no
+	 * LaTeX macro is known, a warning is given of lyxerr, and the
+	 * character is returned.
+	 */
+	docstring const latexChar(char_type c) const;
+	/// Return the preamble snippet needed for the output of latexChar
+	/// (if any)
+	std::string const preamble(char_type c) const;
 private:
 	///
 	std::string Name_;
@@ -46,6 +54,11 @@ private:
 	std::string LatexName_;
 	///
 	std::string iconvName_;
+	///
+	typedef std::map<char_type, char> TransTable;
+	/// Translation table from UCS4 to this encoding (for singlebyte
+	/// encodings only)
+	TransTable ucs4_to_self_;
 };
 
 class Encodings {
@@ -64,8 +77,11 @@ public:
 	};
 	///
 	Encodings();
-	///
-	void read(support::FileName const & filename);
+	/// Read the encodings.
+	/// \param encfile encodings definition file
+	/// \param symbolsfile unicode->LaTeX mapping file
+	void read(support::FileName const & encfile,
+	          support::FileName const & symbolsfile);
 	/// Get encoding from LyX name \p name
 	Encoding const * getFromLyXName(std::string const & name) const;
 	/// Get encoding from LaTeX name \p name
@@ -97,6 +113,8 @@ public:
 	static bool is_arabic(char_type c);
 	///
 	static char_type transformChar(char_type c, Letter_Form form);
+	/// Is this a combining char?
+	static bool isCombiningChar(char_type c);
 
 private:
 	///
Index: src/paragraph_pimpl.C
===================================================================
--- src/paragraph_pimpl.C	(Revision 16797)
+++ src/paragraph_pimpl.C	(Arbeitskopie)
@@ -59,16 +59,11 @@ special_phrase const special_phrases[] =
 size_t const phrases_nr = sizeof(special_phrases)/sizeof(special_phrase);
 
 
-bool isEncoding(BufferParams const & bparams, LyXFont const & font,
-		string const & encoding)
+Encoding const & getEncoding(BufferParams const & bparams, LyXFont const & font)
 {
-	// We do ignore bparams.inputenc == "default" here because characters
-	// in this encoding could be treated by TeX as something different,
-	// e.g. if they are inside a CJK environment. See also
-	// http://bugzilla.lyx.org/show_bug.cgi?id=3043.
-	return (bparams.inputenc == encoding
-		|| (bparams.inputenc == "auto"
-		    && font.language()->encoding()->latexName() == encoding));
+	if (bparams.inputenc == "auto" || bparams.inputenc == "default")
+		return *(font.language()->encoding());
+	return bparams.encoding();
 }
 
 } // namespace anon
@@ -381,8 +376,36 @@ int Paragraph::Pimpl::eraseChars(pos_typ
 }
 
 
-void Paragraph::Pimpl::simpleTeXBlanks(odocstream & os, TexRow & texrow,
-				       pos_type const i,
+int Paragraph::Pimpl::latexSurrogatePair(BufferParams const & bparams,
+		odocstream & os, LyXFont const & font, pos_type & i,
+		value_type c)
+{
+	if (i < size() - 1) {
+		char_type next = getChar(i + 1);
+		if (Encodings::isCombiningChar(next)) {
+			// Writing next here may circumvent a possible font
+			// change between c and next. Since next is only
+			// output if it forms a surrogate pair with c we can
+			// ignore this:
+			// A font change inside a surrogate pair does not make
+			// sense and is hopefully impossible to input.
+			// FIXME: Does this work with change tracking?
+			Encoding const & encoding = getEncoding(bparams, font);
+			docstring const latex1 = encoding.latexChar(next);
+			docstring const latex2 = encoding.latexChar(c);
+			os << latex1 << '{' << latex2 << '}';
+			lyxerr << "output combining pair: '" << to_utf8(latex1 + '{' + latex2 + '}') << "'." << endl;
+			++i;
+			return latex1.length() + latex2.length() + 2;
+		}
+	}
+	return 0;
+}
+
+
+void Paragraph::Pimpl::simpleTeXBlanks(BufferParams const & bparams,
+                                       odocstream & os, TexRow & texrow,
+				       pos_type & i,
 				       unsigned int & column,
 				       LyXFont const & font,
 				       LyXLayout const & style)
@@ -390,6 +413,13 @@ void Paragraph::Pimpl::simpleTeXBlanks(o
 	if (style.pass_thru)
 		return;
 
+	int n = latexSurrogatePair(bparams, os, font, i, ' ');
+	if (n > 0) {
+		// This space has an accent, so we must always output it.
+		column += n - 1;
+		return;
+	}
+
 	if (lyxrc.plaintext_linelen > 0
 	    && column > lyxrc.plaintext_linelen
 	    && i
@@ -463,6 +493,8 @@ void Paragraph::Pimpl::simpleTeXSpecialC
 	if (style.pass_thru) {
 		if (c != Paragraph::META_INSET) {
 			if (c != '\0')
+				// FIXME UNICODE: This can fail if c cannot
+				// be encoded in the current encoding.
 				os.put(c);
 		} else
 			owner_->getInset(i)->plaintext(buf, os, runparams);
@@ -579,25 +611,6 @@ void Paragraph::Pimpl::simpleTeXSpecialC
 		// would be wrongly converted on systems where char is signed, so we give
 		// the code points.
 		// This also makes us independant from the encoding of this source file.
-		case 0xb1:    // ± PLUS-MINUS SIGN
-		case 0xb2:    // ² SUPERSCRIPT TWO
-		case 0xb3:    // ³ SUPERSCRIPT THREE
-		case 0xd7:    // × MULTIPLICATION SIGN
-		case 0xf7:    // ÷ DIVISION SIGN
-		case 0xb9:    // ¹ SUPERSCRIPT ONE
-		case 0xac:    // ¬ NOT SIGN
-		case 0xb5:    // µ MICRO SIGN
-			if (isEncoding(bparams, font, "latin1")
-			    || isEncoding(bparams, font, "latin9")) {
-				os << "\\ensuremath{";
-				os.put(c);
-				os << '}';
-				column += 13;
-			} else {
-				os.put(c);
-			}
-			break;
-
 		case '|': case '<': case '>':
 			// In T1 encoding, these characters exist
 			if (lyxrc.fontenc == "T1") {
@@ -656,82 +669,6 @@ void Paragraph::Pimpl::simpleTeXSpecialC
 			column += 9;
 			break;
 
-		case 0xa3:    // £ POUND SIGN
-			if (bparams.inputenc == "default") {
-				os << "\\pounds{}";
-				column += 8;
-			} else {
-				os.put(c);
-			}
-			break;
-
-		case 0x20ac:    // EURO SIGN
-			if (isEncoding(bparams, font, "latin9")
-			    || isEncoding(bparams, font, "cp1251")
-			    || isEncoding(bparams, font, "utf8")
-			    || isEncoding(bparams, font, "latin10")
-			    || isEncoding(bparams, font, "cp858")) {
-				os.put(c);
-			} else {
-				os << "\\texteuro{}";
-				column += 10;
-			}
-			break;
-
-		// These characters are covered by latin1, but not
-		// by latin9 (a.o.). We have to support them because
-		// we switched the default of latin1-languages to latin9
-		case 0xa4:    // CURRENCY SYMBOL
-		case 0xa6:    // BROKEN BAR
-		case 0xa8:    // DIAERESIS
-		case 0xb4:    // ACUTE ACCENT
-		case 0xb8:    // CEDILLA
-		case 0xbd:    // 1/2 FRACTION
-		case 0xbc:    // 1/4 FRACTION
-		case 0xbe:    // 3/4 FRACTION
-			if (isEncoding(bparams, font, "latin1")
-			    || isEncoding(bparams, font, "latin5")
-			    || isEncoding(bparams, font, "utf8")) {
-				os.put(c);
-				break;
-			} else {
-				switch (c) {
-				case 0xa4:
-					os << "\\textcurrency{}";
-					column += 15;
-					break;
-				case 0xa6:
-					os << "\\textbrokenbar{}";
-					column += 16;
-					break;
-				case 0xa8:
-					os << "\\textasciidieresis{}";
-					column += 20;
-					break;
-				case 0xb4:
-					os << "\\textasciiacute{}";
-					column += 17;
-					break;
-				case 0xb8: // from latin1.def:
-					os << "\\c\\ ";
-					column += 3;
-					break;
-				case 0xbd:
-					os << "\\textonehalf{}";
-					column += 14;
-					break;
-				case 0xbc:
-					os << "\\textonequarter{}";
-					column += 17;
-					break;
-				case 0xbe:
-					os << "\\textthreequarters{}";
-					column += 20;
-					break;
-				}
-				break;
-			}
-
 		case '$': case '&':
 		case '%': case '#': case '{':
 		case '}': case '_':
@@ -769,6 +706,8 @@ void Paragraph::Pimpl::simpleTeXSpecialC
 		default:
 
 			// I assume this is hack treating typewriter as verbatim
+			// FIXME UNICODE: This can fail if c cannot be encoded
+			// in the current encoding.
 			if (font.family() == LyXFont::TYPEWRITER_FAMILY) {
 				if (c != '\0') {
 					os.put(c);
@@ -796,7 +735,16 @@ void Paragraph::Pimpl::simpleTeXSpecialC
 			}
 
 			if (pnr == phrases_nr && c != '\0') {
-				os.put(c);
+				int n = latexSurrogatePair(bparams, os, font, i, c);
+				if (n > 0) {
+					column += n - 1;
+					break;
+				}
+				Encoding const & encoding = getEncoding(bparams, font);
+				docstring const latex = encoding.latexChar(c);
+				lyxerr << "output char: '" << to_utf8(latex) << "'." << endl;
+				column += latex.length() - 1;
+				os << latex;
 			}
 			break;
 		}
@@ -805,6 +753,7 @@ void Paragraph::Pimpl::simpleTeXSpecialC
 
 
 void Paragraph::Pimpl::validate(LaTeXFeatures & features,
+                                LyXFont const & outerfont,
 				LyXLayout const & layout) const
 {
 	BufferParams const & bparams = features.bufferParams();
@@ -882,12 +831,11 @@ void Paragraph::Pimpl::validate(LaTeXFea
 				break;
 			}
 		}
-		// these glyphs require the textcomp package
-		if (getChar(i) == 0x20ac || getChar(i) == 0xa4
-		    || getChar(i) == 0xa6 || getChar(i) == 0xa8
-		    || getChar(i) == 0xb4 || getChar(i) == 0xbd
-		    || getChar(i) == 0xbc || getChar(i) == 0xbe)
-			features.require("textcomp");
+		LyXFont const & font = owner_->getFont(bparams, i, outerfont);
+		Encoding const & encoding = getEncoding(bparams, font);
+		string const preamble = encoding.preamble(getChar(i));
+		if (!preamble.empty())
+			features.addPreambleSnippet(preamble);
 	}
 }
 
Index: src/buffer.C
===================================================================
--- src/buffer.C	(Revision 16797)
+++ src/buffer.C	(Arbeitskopie)
@@ -141,7 +141,7 @@ using std::string;
 
 namespace {
 
-int const LYX_FORMAT = 256;
+int const LYX_FORMAT = 257;
 
 } // namespace anon
 
@@ -1192,8 +1192,13 @@ void Buffer::validate(LaTeXFeatures & fe
 	if (params().use_esint == BufferParams::package_on)
 		features.require("esint");
 
-	for_each(paragraphs().begin(), paragraphs().end(),
-		 boost::bind(&Paragraph::validate, _1, boost::ref(features)));
+	ParagraphList::const_iterator const begin = paragraphs().begin();
+	ParagraphList::const_iterator const end = paragraphs().end();
+	for (ParagraphList::const_iterator it = begin; it !=end; ++it) {
+		LyXFont const outerfont =
+			outerFont(std::distance(begin, it), paragraphs());
+		it->validate(features, outerfont);
+	}
 
 	// the bullet shapes are buffer level not paragraph level
 	// so they are tested here
Index: src/paragraph_pimpl.h
===================================================================
--- src/paragraph_pimpl.h	(Revision 16797)
+++ src/paragraph_pimpl.h	(Arbeitskopie)
@@ -123,9 +123,14 @@ public:
 	///
 	FontList fontlist;
 
+	/// If \p c forms a surrogate pair with the character at position
+	/// i + 1, output it to \p os and return the number of characters
+	/// written. Otherwise do nothing and return 0.
+	int latexSurrogatePair(BufferParams const &, odocstream & os,
+	                       LyXFont const &, pos_type & i, value_type c);
 	///
-	void simpleTeXBlanks(odocstream &, TexRow & texrow,
-			     pos_type const i,
+	void simpleTeXBlanks(BufferParams const &, odocstream &, TexRow & texrow,
+			     pos_type & i,
 			     unsigned int & column,
 			     LyXFont const & font,
 			     LyXLayout const & style);
@@ -144,6 +149,7 @@ public:
 
 	///
 	void validate(LaTeXFeatures & features,
+	              LyXFont const & outerfont,
 		      LyXLayout const & layout) const;
 
 	///
Index: src/trans_mgr.C
===================================================================
--- src/trans_mgr.C	(Revision 16797)
+++ src/trans_mgr.C	(Arbeitskopie)
@@ -22,8 +22,6 @@
 #include "lyxtext.h"
 #include "trans.h"
 
-#include "insets/insetlatexaccent.h"
-
 #include "support/lstrings.h"
 
 
@@ -287,14 +285,7 @@ void TransManager::insert(string const &
 	if (chset_.getName() != lyxrc.font_norm ||
 	    !enc.first) {
 		// Could not find an encoding
-		InsetLatexAccent ins(str);
-		if (ins.canDisplay()) {
-			cap::replaceSelection(cur);
-			cur.insert(new InsetLatexAccent(ins));
-			cur.posRight();
-		} else {
-			insertVerbatim(str, text, cur);
-		}
+		insertVerbatim(str, text, cur);
 		return;
 	}
 	string const tmp(1, static_cast<char>(enc.second));
Index: src/lyx_main.C
===================================================================
--- src/lyx_main.C	(Revision 16797)
+++ src/lyx_main.C	(Arbeitskopie)
@@ -871,7 +871,7 @@ bool LyX::init()
 	if (!readRcFile("preferences"))
 		return false;
 
-	if (!readEncodingsFile("encodings"))
+	if (!readEncodingsFile("encodings", "unicodesymbols"))
 		return false;
 	if (!readLanguagesFile("languages"))
 		return false;
@@ -1238,16 +1238,24 @@ bool LyX::readLanguagesFile(string const
 
 
 // Read the encodings file `name'
-bool LyX::readEncodingsFile(string const & name)
+bool LyX::readEncodingsFile(string const & enc_name,
+                            string const & symbols_name)
 {
-	lyxerr[Debug::INIT] << "About to read " << name << "..." << endl;
+	lyxerr[Debug::INIT] << "About to read " << enc_name << " and "
+	                    << symbols_name << "..." << endl;
 
-	FileName const enc_path = libFileSearch(string(), name);
+	FileName const symbols_path = libFileSearch(string(), symbols_name);
+	if (symbols_path.empty()) {
+		showFileError(symbols_name);
+		return false;
+	}
+
+	FileName const enc_path = libFileSearch(string(), enc_name);
 	if (enc_path.empty()) {
-		showFileError(name);
+		showFileError(enc_name);
 		return false;
 	}
-	encodings.read(enc_path);
+	encodings.read(enc_path, symbols_path);
 	return true;
 }
 
Index: src/paragraph.C
===================================================================
--- src/paragraph.C	(Revision 16797)
+++ src/paragraph.C	(Arbeitskopie)
@@ -229,9 +229,10 @@ void Paragraph::write(Buffer const & buf
 }
 
 
-void Paragraph::validate(LaTeXFeatures & features) const
+void Paragraph::validate(LaTeXFeatures & features,
+                         LyXFont const & outerfont) const
 {
-	pimpl_->validate(features, *layout());
+	pimpl_->validate(features, outerfont, *layout());
 }
 
 
@@ -1039,16 +1040,6 @@ bool Paragraph::simpleTeXOnePar(Buffer c
 
 		LyXFont const last_font = running_font;
 
-		// Spaces at end of font change are simulated to be
-		// outside font change, i.e. we write "\textXX{text} "
-		// rather than "\textXX{text }". (Asger)
-		if (open_font && c == ' ' && i <= size() - 2) {
-			LyXFont const & next_font = getFont(bparams, i + 1, outerfont);
-			if (next_font != running_font && next_font != font) {
-				font = next_font;
-			}
-		}
-
 		// We end font definition before blanks
 		if (open_font &&
 		    (font != running_font ||
@@ -1062,15 +1053,6 @@ bool Paragraph::simpleTeXOnePar(Buffer c
 			open_font = false;
 		}
 
-		// Blanks are printed before start of fontswitch
-		if (c == ' ') {
-			// Do not print the separation of the optional argument
-			if (i != body_pos - 1) {
-				pimpl_->simpleTeXBlanks(os, texrow, i,
-						       column, font, *style);
-			}
-		}
-
 		// Do we need to change font?
 		if ((font != running_font ||
 		     font.language() != running_font.language()) &&
@@ -1082,6 +1064,14 @@ bool Paragraph::simpleTeXOnePar(Buffer c
 			open_font = true;
 		}
 
+		if (c == ' ') {
+			// Do not print the separation of the optional argument
+			if (i != body_pos - 1) {
+				pimpl_->simpleTeXBlanks(bparams, os, texrow, i,
+						       column, font, *style);
+			}
+		}
+
 		Change::Type changeType = pimpl_->lookupChange(i).type;
 
 		column += Changes::latexMarkChange(os, runningChangeType,
Index: src/text.C
===================================================================
--- src/text.C	(Revision 16797)
+++ src/text.C	(Arbeitskopie)
@@ -58,7 +58,6 @@
 #include "insets/insettext.h"
 #include "insets/insetbibitem.h"
 #include "insets/insethfill.h"
-#include "insets/insetlatexaccent.h"
 #include "insets/insetline.h"
 #include "insets/insetnewline.h"
 #include "insets/insetpagebreak.h"
@@ -240,10 +239,6 @@ void readParToken(Buffer const & buf, Pa
 			par.insertInset(par.size(), inset.release(),
 					font, change);
 		}
-	} else if (token == "\\i") {
-		auto_ptr<InsetBase> inset(new InsetLatexAccent);
-		inset->read(buf, lex);
-		par.insertInset(par.size(), inset.release(), font, change);
 	} else if (token == "\\backslash") {
 		par.insertChar(par.size(), '\\', font, change);
 	} else if (token == "\\newline") {
Index: src/lyx_main.h
===================================================================
--- src/lyx_main.h	(Revision 16797)
+++ src/lyx_main.h	(Arbeitskopie)
@@ -149,8 +149,11 @@ private:
 	bool readUIFile(std::string const & name, bool include = false);
 	/// read the given languages file
 	bool readLanguagesFile(std::string const & name);
-	/// read the given encodings file
-	bool readEncodingsFile(std::string const & name);
+	/// read the encodings.
+	/// \param enc_name encodings definition file
+	/// \param symbols_name unicode->LaTeX mapping file
+	bool readEncodingsFile(std::string const & enc_name,
+	                       std::string const & symbols_name);
 	/// parsing of non-gui LyX options.
 	void easyParse(int & argc, char * argv[]);
 	/// shows up a parsing error on screen
Index: src/encoding.C
===================================================================
--- src/encoding.C	(Revision 16797)
+++ src/encoding.C	(Arbeitskopie)
@@ -19,10 +19,14 @@
 #include "lyxrc.h"
 
 #include "support/filename.h"
+#include "support/lstrings.h"
+#include "support/unicode.h"
 
 
 namespace lyx {
 
+using support::FileName;
+
 #ifndef CXX_GLOBAL_CSTD
 using std::strtol;
 #endif
@@ -177,9 +181,63 @@ char_type arabic_table[63][2] = {
 
 char_type const arabic_start = 0xc1;
 
+
+struct CharInfo {
+	docstring command;
+	string preamble;
+	bool combining;
+};
+
+
+typedef std::map<char_type, CharInfo> CharInfoMap;
+CharInfoMap unicodesymbols;
+
 } // namespace anon
 
 
+Encoding::Encoding(string const & n, string const & l, string const & i)
+	: Name_(n), LatexName_(l), iconvName_(i)
+{
+	if (n != "utf8") {
+		for (unsigned short j = 0; j < 256; ++j) {
+			char const c = j;
+			std::vector<char_type> const ucs4 = eightbit_to_ucs4(&c, 1, i);
+			if (ucs4.size() == 1)
+				ucs4_to_self_[ucs4[0]] = c;
+		}
+	}
+}
+
+
+docstring const Encoding::latexChar(char_type c) const
+{
+	TransTable::const_iterator const it1 = ucs4_to_self_.find(c);
+	if (it1 == ucs4_to_self_.end()) {
+		// c cannot be encoded in this encoding
+		CharInfoMap::const_iterator const it2 = unicodesymbols.find(c);
+		if (it2 == unicodesymbols.end())
+			lyxerr << "Could not find LaTeX command for character 0x"
+			       << std::hex << c << ".\nLaTeX export will fail."
+			       << endl;
+		else
+			return it2->second.command;
+	}
+	return docstring(1, c);
+}
+
+
+string const Encoding::preamble(char_type c) const
+{
+	TransTable::const_iterator const it1 = ucs4_to_self_.find(c);
+	if (it1 == ucs4_to_self_.end()) {
+		// c cannot be encoded in this encoding
+		CharInfoMap::const_iterator const it2 = unicodesymbols.find(c);
+		if (it2 != unicodesymbols.end())
+			return it2->second.preamble;
+	}
+	return string();
+}
+
 
 bool Encodings::isComposeChar_hebrew(char_type c)
 {
@@ -226,6 +284,15 @@ char_type Encodings::transformChar(char_
 }
 
 
+bool Encodings::isCombiningChar(char_type c)
+{
+	CharInfoMap::const_iterator const it = unicodesymbols.find(c);
+	if (it != unicodesymbols.end())
+		return it->second.combining;
+	return false;
+}
+
+
 Encoding const * Encodings::getFromLyXName(string const & name) const
 {
 	EncodingList::const_iterator it = encodinglist.find(name);
@@ -269,8 +336,60 @@ Encodings::Encodings()
 {
 }
 
-void Encodings::read(support::FileName const & filename)
+
+void Encodings::read(FileName const & encfile, FileName const & symbolsfile)
 {
+	// We must read the symbolsfile first, because the Encoding
+	// constructor depends on it.
+	LyXLex symbolslex(0, 0);
+	symbolslex.setFile(symbolsfile);
+	while (symbolslex.isOK()) {
+		char_type symbol;
+		CharInfo info;
+		string flags;
+
+		if (symbolslex.next(true)) {
+			std::istringstream is(symbolslex.getString());
+			// reading symbol directly does not work if
+			// char_type == std::wchar_t.
+			boost::uint32_t tmp;
+			if(!(is >> std::hex >> tmp))
+				break;
+			symbol = tmp;
+		} else
+			break;
+		if (symbolslex.next(true))
+			info.command = symbolslex.getDocString();
+		else
+			break;
+		if (symbolslex.next(true))
+			info.preamble = symbolslex.getString();
+		else
+			break;
+		if (symbolslex.next(true))
+			flags = symbolslex.getString();
+		else
+			break;
+
+		info.combining = false;
+		while (!flags.empty()) {
+			string flag;
+			flags = support::split(flags, flag, ',');
+			if (flag == "combining")
+				info.combining = true;
+			else
+				lyxerr << "Ignoring unknown flag `" << flag
+				       << "' for symbol `0x" << std::hex
+				       << symbol << "'." << endl;
+		}
+
+		lyxerr << "Read unicode symbol " << symbol << " '"
+		       << to_utf8(info.command) << "' '" << info.preamble
+		       << "' " << info.combining << endl;
+		unicodesymbols[symbol] = info;
+	}
+
+	// Now read the encodings
 	enum Encodingtags {
 		et_encoding = 1,
 		et_end,
@@ -283,7 +402,7 @@ void Encodings::read(support::FileName c
 	};
 
 	LyXLex lex(encodingtags, et_last - 1);
-	lex.setFile(filename);
+	lex.setFile(encfile);
 	while (lex.isOK()) {
 		switch (lex.lex()) {
 		case et_encoding:
Index: lib/lyx2lyx/LyX.py
===================================================================
--- lib/lyx2lyx/LyX.py	(Revision 16797)
+++ lib/lyx2lyx/LyX.py	(Arbeitskopie)
@@ -73,7 +73,7 @@ format_relation = [("0_06",    [200], ge
                    ("1_2",     [220], generate_minor_versions("1.2" , 4)),
                    ("1_3",     [221], generate_minor_versions("1.3" , 7)),
                    ("1_4", range(222,246), generate_minor_versions("1.4" , 3)),
-                   ("1_5", range(246,257), generate_minor_versions("1.5" , 0))]
+                   ("1_5", range(246,258), generate_minor_versions("1.5" , 0))]
 
 
 def formats_list():
Index: lib/lyx2lyx/lyx_1_5.py
===================================================================
--- lib/lyx2lyx/lyx_1_5.py	(Revision 16797)
+++ lib/lyx2lyx/lyx_1_5.py	(Arbeitskopie)
@@ -20,7 +20,9 @@
 """ Convert files to the file format generated by lyx 1.5"""
 
 import re
-from parser_tools import find_token, find_token_exact, find_tokens, find_end_of, get_value
+import unicodedata
+
+from parser_tools import find_re, find_token, find_token_exact, find_tokens, find_end_of, get_value
 from LyX import get_encoding
 
 
@@ -720,6 +722,251 @@ def revert_encodings(document):
     document.inputencoding = get_value(document.header, "\\inputencoding", 0)
 
 
+# Accents of InsetLaTeXAccent
+accent_map = {
+    "`" : u'\u0309', # grave
+    "'" : u'\u0301', # acute
+    "^" : u'\u0302', # circumflex
+    "~" : u'\u0303', # tilde
+    "=" : u'\u0304', # macron
+    "u" : u'\u0306', # breve
+    "." : u'\u0307', # dot above
+    "\"": u'\u0308', # diaresis
+    "r" : u'\u030a', # ring above
+    "H" : u'\u030b', # double acute
+    "v" : u'\u030c', # caron
+    "b" : u'\u0320', # minus sign below
+    "d" : u'\u0323', # dot below
+    "c" : u'\u0327', # cedilla
+    "k" : u'\u0328', # ogonek
+    "t" : u'\u0361'  # tie. This is special: It spans two characters, but
+                     # only one is given as argument, so we don't need to
+                     # treat it differently.
+}
+
+
+# special accents of InsetLaTeXAccent without argument
+special_accent_map = {
+    'i' : u'\u0131', # dotless i
+    'j' : u'\u0237', # dotless j
+    'l' : u'\u0142', # l with stroke
+    'L' : u'\u0141'  # L with stroke
+}
+
+
+# special accent arguments of InsetLaTeXAccent
+accented_map = {
+    '\\i' : u'\u0131', # dotless i
+    '\\j' : u'\u0237'  # dotless j
+}
+
+
+def _convert_accent(accent, accented_char):
+    type = accent
+    char = accented_char
+    if char == '':
+        if type in special_accent_map:
+            return special_accent_map[type]
+        # a missing char is treated as space by LyX
+        char = ' '
+    elif type == 'q' and char in ['t', 'd', 'l', 'L']:
+        # Special caron, only used with t, d, l and L.
+        # It is not in the map because we convert it to the same unicode
+        # character as the normal caron: \q{} is only defined if babel with
+        # the czech or slovak language is used, and the normal caron
+        # produces the correct output if the T1 font encoding is used.
+        # For the same reason we never convert to \q{} in the other direction.
+        type = 'v'
+    elif char in accented_map:
+        char = accented_map[char]
+    elif (len(char) > 1):
+        # We can only convert accents on a single char
+        return ''
+    a = accent_map.get(type)
+    if a:
+        return unicodedata.normalize("NFKC", "%s%s" % (char, a))
+    return ''
+
+
+def convert_ertbackslash(body, i, ert, default_layout):
+    r""" -------------------------------------------------------------------------------------------
+    Convert backslashes and '\n' into valid ERT code, append the converted
+    text to body[i] and return the (maybe incremented) line index i"""
+
+    for c in ert:
+        if c == '\\':
+            body[i] = body[i] + '\\backslash '
+            i = i + 1
+            body.insert(i, '')
+        elif c == '\n':
+            body[i+1:i+1] = ['\\end_layout', '', '\\begin_layout %s' % default_layout, '']
+            i = i + 4
+        else:
+            body[i] = body[i] + c
+    return i
+
+
+def convert_accent(document):
+    # The following forms are supported by LyX:
+    # '\i \"{a}' (standard form, as written by LyX)
+    # '\i \"{}' (standard form, as written by LyX if the accented char is a space)
+    # '\i \"{ }' (also accepted if the accented char is a space)
+    # '\i \" a'  (also accepted)
+    # '\i \"'    (also accepted)
+    re_wholeinset = re.compile(r'^(.*)(\\i\s+)(.*)$')
+    re_contents = re.compile(r'^([^\s{]+)(.*)$')
+    re_accentedcontents = re.compile(r'^\s*{?([^{}]*)}?\s*$')
+    i = 0
+    while 1:
+        i = find_re(document.body, re_wholeinset, i)
+        if i == -1:
+            return
+        match = re_wholeinset.match(document.body[i])
+        prefix = match.group(1)
+        contents = match.group(3).strip()
+        match = re_contents.match(contents)
+        if match:
+            # Strip first char (always \)
+            accent = match.group(1)[1:]
+            accented_contents = match.group(2).strip()
+            match = re_accentedcontents.match(accented_contents)
+            accented_char = match.group(1)
+            converted = _convert_accent(accent, accented_char)
+            if converted == '':
+                # Normalize contents
+                contents = '%s{%s}' % (accent, accented_char),
+            else:
+                document.body[i] = '%s%s' % (prefix, converted)
+                i += 1
+                continue
+        document.warning("Converting unknown InsetLaTeXAccent `\\i %s' to ERT." % contents)
+        document.body[i] = prefix
+        document.body[i+1:i+1] = ['\\begin_inset ERT',
+                                  'status collapsed',
+                                  '',
+                                  '\\begin_layout %s' % document.default_layout,
+                                  '',
+                                  '',
+                                  '']
+        i = convert_ertbackslash(document.body, i + 7,
+                                 '\\%s' % contents,
+                                 document.default_layout)
+        document.body[i+1:i+1] = ['\\end_layout',
+                                  '',
+                                  '\\end_inset']
+        i += 3
+
+
+def revert_accent(document):
+    inverse_accent_map = {}
+    for k in accent_map:
+        inverse_accent_map[accent_map[k]] = k
+    inverse_special_accent_map = {}
+    for k in special_accent_map:
+        inverse_special_accent_map[special_accent_map[k]] = k
+    inverse_accented_map = {}
+    for k in accented_map:
+        inverse_accented_map[accented_map[k]] = k
+
+    # Since LyX may insert a line break within a word we must combine all
+    # words before unicode normalization.
+    # We do this only if the next line starts with an accent, otherwise we
+    # would create things like '\begin_inset ERTstatus'.
+    numberoflines = len(document.body)
+    for i in range(numberoflines-1):
+        if document.body[i] == '' or document.body[i+1] == '' or document.body[i][-1] == ' ':
+            continue
+        if (document.body[i+1][0] in inverse_accent_map):
+            # the last character of this line and the first of the next line
+            # form probably a surrogate pair.
+            while (len(document.body[i+1]) > 0 and document.body[i+1][0] != ' '):
+                document.body[i] += document.body[i+1][0]
+                document.body[i+1] = document.body[i+1][1:]
+
+    # Normalize to "Normal form D" (NFD, also known as canonical decomposition).
+    # This is needed to catch all accented characters.
+    for i in range(numberoflines):
+        # Unfortunately we have a mixture of unicode strings and plain strings,
+        # because we never use u'xxx' for string literals, but 'xxx'.
+        # Therefore we may have to try two times to normalize the data.
+        try:
+            document.body[i] = unicodedata.normalize("NFKD", document.body[i])
+        except TypeError:
+            document.body[i] = unicodedata.normalize("NFKD", unicode(document.body[i], 'utf-8'))
+
+    # Replace accented characters with InsetLaTeXAccent
+    # Do not convert characters that can be represented in the chosen
+    # encoding.
+    encoding_stack = [get_encoding(document.language, document.inputencoding, 248)]
+    lang_re = re.compile(r"^\\lang\s(\S+)")
+    for i in range(len(document.body)):
+
+        if document.inputencoding == "auto" or document.inputencoding == "default":
+            # Track the encoding of the current line
+            result = lang_re.match(document.body[i])
+            if result:
+                language = result.group(1)
+                if language == "default":
+                    encoding_stack[-1] = document.encoding
+                else:
+                    from lyx2lyx_lang import lang
+                    encoding_stack[-1] = lang[language][3]
+                continue
+            elif find_token(document.body, "\\begin_layout", i, i + 1) == i:
+                encoding_stack.append(encoding_stack[-1])
+                continue
+            elif find_token(document.body, "\\end_layout", i, i + 1) == i:
+                del encoding_stack[-1]
+                continue
+
+        for j in range(len(document.body[i])):
+            # dotless i and dotless j are both in special_accent_map and can
+            # occur as an accented character, so we need to test that the
+            # following character is no accent
+            if (document.body[i][j] in inverse_special_accent_map and
+                (j == len(document.body[i]) - 1 or document.body[i][j+1] not in inverse_accent_map)):
+                accent = document.body[i][j]
+                try:
+                    dummy = accent.encode(encoding_stack[-1])
+                except UnicodeEncodeError:
+                    # Insert the rest of the line as new line
+                    if j < len(document.body[i]) - 1:
+                        document.body[i+1:i+1] = document.body[i][j+1:]
+                    # Delete the accented character
+                    if j > 0:
+                        document.body[i] = document.body[i][:j-1]
+                    else:
+                        document.body[i] = u''
+                    # Finally add the InsetLaTeXAccent
+                    document.body[i] += "\\i \\%s{}" % inverse_special_accent_map[accent]
+                    break
+            elif j > 0 and document.body[i][j] in inverse_accent_map:
+                accented_char = document.body[i][j-1]
+                if accented_char == ' ':
+                    # Conform to LyX output
+                    accented_char = ''
+                elif accented_char in inverse_accented_map:
+                    accented_char = inverse_accented_map[accented_char]
+                accent = document.body[i][j]
+                try:
+                    dummy = unicodedata.normalize("NFKC", accented_char + accent).encode(encoding_stack[-1])
+                except UnicodeEncodeError:
+                    # Insert the rest of the line as new line
+                    if j < len(document.body[i]) - 1:
+                        document.body[i+1:i+1] = document.body[i][j+1:]
+                    # Delete the accented characters
+                    if j > 1:
+                        document.body[i] = document.body[i][:j-2]
+                    else:
+                        document.body[i] = u''
+                    # Finally add the InsetLaTeXAccent
+                    document.body[i] += "\\i \\%s{%s}" % (inverse_accent_map[accent], accented_char)
+                    break
+    # Normalize to "Normal form C" (NFC, pre-composed characters) again
+    for i in range(numberoflines):
+        document.body[i] = unicodedata.normalize("NFKC", document.body[i])
+
+
 ##
 # Conversion hub
 #
@@ -735,16 +982,18 @@ convert = [[246, []],
            [253, []],
            [254, [convert_esint]],
            [255, []],
-           [256, []]]
+           [256, []],
+           [257, [convert_accent]]]
 
-revert =  [[255, [revert_encodings]],
+revert =  [[256, []],
+           [255, [revert_encodings]],
            [254, [revert_clearpage, revert_cleardoublepage]],
            [253, [revert_esint]],
            [252, [revert_nomenclature, revert_printnomenclature]],
            [251, [revert_commandparams]],
            [250, [revert_cs_label]],
            [249, []],
-           [248, [revert_utf8]],
+           [248, [revert_accent, revert_utf8]],
            [247, [revert_booktabs]],
            [246, [revert_font_settings]],
            [245, [revert_framed]]]
Index: lib/unicodesymbols
===================================================================
--- lib/unicodesymbols	(Revision 0)
+++ lib/unicodesymbols	(Revision 0)
@@ -0,0 +1,69 @@
+#
+# file unicodesymbols
+# This file is part of LyX, the document processor.
+# Licence details can be found in the file COPYING.
+#
+# author Georg Baum
+#
+# Full author contact details are available in file CREDITS.
+
+# This file is a database of LaTeX commands for unicode characters.
+# These commands will be used by LyX for LaTeX export for all characters
+# that are not representable in the chosen encoding.
+
+# syntax:
+# ucs4 command                   preamble                flags
+0x00a3 "\\pounds{}"               "\\usepackage{textcomp}" "" # Â£ POUND SIGN
+0x00a4 "\\textcurrency{}"         "\\usepackage{textcomp}" "" # CURRENCY SYMBOL
+0x00a6 "\\textbrokenbar{}"        "\\usepackage{textcomp}" "" # BROKEN BAR
+0x00a8 "\\textasciidieresis{}"    "\\usepackage{textcomp}" "" # DIAERESIS
+0x00ac "\\textlnot{}"             "\\usepackage{textcomp}" "" # Â¬ NOT SIGN
+0x00b1 "\\textpm{}"               "\\usepackage{textcomp}" "" # Â± PLUS-MINUS SIGN
+0x00b2 "\\textsuperscript{2}"     "\\usepackage{textcomp}" "" # Â² SUPERSCRIPT TWO
+0x00b3 "\\textsuperscript{3}"     "\\usepackage{textcomp}" "" # Â³ SUPERSCRIPT THREE
+0x00b4 "\\textasciiacute{}"       "\\usepackage{textcomp}" "" # ACUTE ACCENT
+0x00b5 "\\textmu{}"               "\\usepackage{textcomp}" "" # Âµ MICRO SIGN
+0x00b8 "\\c\\ "                   "" "" # CEDILLA (command from latin1.def)
+0x00b9 "\\textsuperscript{1}"     "\\usepackage{textcomp}" "" # Â¹ SUPERSCRIPT ONE
+0x00bc "\\textonequarter{}"       "\\usepackage{textcomp}" "" # 1/4 FRACTION
+0x00bd "\\textonehalf{}"          "\\usepackage{textcomp}" "" # 1/2 FRACTION
+0x00be "\\textthreequarters{}"    "\\usepackage{textcomp}" "" # 3/4 FRACTION
+0x00e1 "\\'{a}"                   "" "" # LATIN SMALL LETTER A WITH ACUTE
+0x00e2 "\\^{a}"                   "" "" # LATIN SMALL LETTER A WITH CIRCUMFLEX
+0x00e3 "\\~{a}"                   "" "" # LATIN SMALL LETTER A WITH TILDE
+0x00e4 "\\\"{a}"                  "" "" # LATIN SMALL LETTER A WITH DIAERESIS
+0x00e5 "\\r{a}"                   "" "" # LATIN SMALL LETTER A WITH RING ABOVE
+0x00d7 "\\texttimes{}"            "\\usepackage{textcomp}" "" # Ã MULTIPLICATION SIGN
+0x00f7 "\\textdiv{}"              "\\usepackage{textcomp}" "" # Ã· DIVISION SIGN
+0x0101 "\\={a}"                   "" "" # LATIN SMALL LETTER A WITH MACRON
+0x0103 "\\u{a}"                   "" "" # LATIN SMALL LETTER A WITH BREVE
+0x010d "\\v{c}"                   "" "" # LATIN SMALL LETTER C WITH CARON
+0x0131 "\\i{}"                    "" "" # dotless i
+0x013d "\\v{L}"                   "" "" # LATIN CAPITAL LETTER L WITH CARON
+0x0141 "\\L{}"                    "" "" # L with stroke
+0x0142 "\\l{}"                    "" "" # l with stroke
+0x01d0 "\\v{\\i}"                 "" "" # LATIN SMALL LETTER I WITH CARON
+0x01f0 "\\v{\\j}"                 "" "" # LATIN SMALL LETTER J WITH CARON
+0x01ce "\\v{a}"                   "" "" # LATIN SMALL LETTER A WITH CARON
+0x0227 "\\.{a}"                   "" "" # LATIN SMALL LETTER A WITH DOT ABOVE
+0x0237 "\\j{}"                    "" "" # dotless j
+0x0301 "\\'"                      "" "combining" # acute
+0x0302 "\\^"                      "" "combining" # circumflex
+0x0303 "\\~"                      "" "combining" # tilde
+0x0304 "\\="                      "" "combining" # macron
+0x0306 "\\u"                      "" "combining" # breve
+0x0307 "\\."                      "" "combining" # dot above
+0x0308 "\\\""                     "" "combining" # diaresis
+0x0309 "\\`"                      "" "combining" # grave
+0x030a "\\r"                      "" "combining" # ring above
+0x030b "\\H"                      "" "combining" # double acute
+0x030c "\\v"                      "" "combining" # caron
+0x0320 "\\b"                      "" "combining" # minus sign below
+0x0323 "\\d"                      "" "combining" # dot below
+0x0327 "\\c"                      "" "combining" # cedilla
+0x0328 "\\k"                      "" "combining" # ogonek
+0x0361 "\\t"                      "" "combining" # tie
+0x1ea1 "\\d{a}"                   "" "" # LATIN SMALL LETTER A WITH DOT BELOW
+0x1ea3 "\\`{a}"                   "" "" # LATIN SMALL LETTER A WITH HOOK ABOVE
+0x20ac "\\texteuro{}"             "" "" # euro sign
+

Eigenschaftsänderungen: lib/unicodesymbols
___________________________________________________________________
Name: svn:eol-style
   + native

Index: lib/Makefile.am
===================================================================
--- lib/Makefile.am	(Revision 16797)
+++ lib/Makefile.am	(Arbeitskopie)
@@ -5,7 +5,7 @@ SUBDIRS = doc lyx2lyx
 CHMOD = chmod
 
 dist_pkgdata_DATA = CREDITS chkconfig.ltx \
-	       external_templates encodings languages symbols syntax.default
+	       external_templates encodings languages symbols syntax.default unicodesymbols
 
 # Note that we "chmod 755" manually this file in install-data-hook.
 dist_pkgdata_PYTHON = configure.py 
Index: development/scons/scons_manifest.py
===================================================================
--- development/scons/scons_manifest.py	(Revision 16797)
+++ development/scons/scons_manifest.py	(Arbeitskopie)
@@ -346,7 +346,6 @@ src_insets_header_files = Split('''
     insetinclude.h
     insetindex.h
     insetlabel.h
-    insetlatexaccent.h
     insetline.h
     insetmarginal.h
     insetnewline.h
@@ -402,7 +401,6 @@ src_insets_files = Split('''
     insetinclude.C
     insetindex.C
     insetlabel.C
-    insetlatexaccent.C
     insetline.C
     insetmarginal.C
     insetnewline.C
@@ -1274,6 +1272,7 @@ lib_files = Split('''
     languages
     symbols
     syntax.default
+    unicodesymbols
     configure.py
 ''')

[patch] Get rid of InsetLaTeXAccent - finally

Reply via email to