Re: [mkgmap-dev] TYP files and character encoding

Ticker Berkin Tue, 14 Jan 2020 01:43:45 -0800

Hi Gerd

Here is updated patch that closes the file, although I find many files
in mkgmap that don't have explicit close(), but I presume .finalize()
will close them eventually.


I'll do another patch for other text file handling, using
StandardCharset where possible and fixing TokenScanner message for bad
characters if not utf-8 and, if reasonable, allowing a BOM even if the
file is opened as utf-8 anyway.

Ticker

On Tue, 2020-01-14 at 08:21 +0000, Gerd Petermann wrote:
> Hi Ticker,
> 
> thanks for the patch.
> 
> Please review TypCompiler.CharsetProbe.  BufferedReader br is not
> closed. Is that intended?
> 
> I see that we have a mix of "utf-8" and "UTF-8" in the mkgmap
> sources. I think it would be good to use StandardCharsets.UTF_8 where
> possible
> and unify the rest.
> 
> Gerd
> 
> ________________________________________
> Von: mkgmap-dev <mkgmap-dev-boun...@lists.mkgmap.org.uk> im Auftrag
> von Ticker Berkin <rwb-mkg...@jagit.co.uk>
> Gesendet: Montag, 13. Januar 2020 11:34
> An: Development list for mkgmap
> Betreff: Re: [mkgmap-dev] TYP files and character encoding
> 
> Hi Gerd
> 
> I've updated this patch with changes to TypCompiler CharsetProbe:
> 
> 1/ looks for unicode BOM in various encodings near start of file.
> 2/ looks for line containing "-*- coding: charset -*-" near start of
> the file.
> 3/ retains the check for "CodePage=" coding for compatibility.
> 4/ in the absence of the above, sets the reading charset to utf-8 if
> the file is valid utf-8, otherwise to Cp1252.
> 5/ fixes the bad character message from the scanner to say what the
> charset really is rather than saying "uft-8" regardless.
> 6/ removes the logic to that checks if String... lines, read in the
> charset it is currently trying, can be encoded in the presumed output
> CodePage.
> 
> The final result of this patch should be that:
> 
> a/ No existing usage is broken
> b/ 2 methods to indicate the charset/encoding of the file that are
> commonly used by text editors can be used and are taken notice of.
> Previously, just the UTF-8 BOM was detected.
> c/ Typ files can, and should from now on, be written in utf-8
> d/ labels for languages not supported in the --code-page of the
> output
> img just generate a warning in mkgmap.log.x
> 
> Ticker
> 
> 
> On Sat, 2019-12-21 at 16:11 +0000, Ticker Berkin wrote:
> > Hi Gerd
> > 
> > Attached is a patch that:
> > 
> > Doesn't use the 'CodePage=' command in the typ-file to determine
> > output
> > character encoding of the typ-file, rather it uses the main map
> > encoding from the --code-page argument.
> > 
> > log.warn's any typ labels that can't be encoded in the --code-page,
> > rather than just giving up with message like:
> > > TYP file cannot be written in code page 1252
> > 
> > The message:
> > > WARNING: SortCode in TYP txt file different from command line
> > > setting
> > that was written direct to system.out is changed to a log.warn and
> > it
> > shouldn't happen anyway now
> > 
> > For the moment, the 'CodePage=' command in the typ-file is, under
> > some
> > circumstances, used to determine the encoding of the typ-file
> > itself
> > and I've left this alone for compatibility with existing useage.
> > Sometime in January I'll provide a better method for this
> > 
> > Ticker
> > 
> > 
> > On Wed, 2019-12-18 at 19:54 +0000, Ticker Berkin wrote:
> > > Hi Gerd
> > > 
> > > I think it is best to continue with the ideas for typ-files that:
> > > 
> > > 1/ they can be in any character set and we just need a better way
> > > of
> > > working out the correct one - see my posting earlier today.
> > > 
> > > 2/ it can include as many languages as anyone can be bothered to
> > > add,
> > > and so has to be an a character set that allows the languages to
> > > be
> > > added, implying unicode for a common one (more particulary, UTF
> > > -8)
> > > 
> > > 3/ the codepage= statement should be redundant and ignored for
> > > controlling the output character set, which should be taken from
> > > the
> > > map, but its use for determining the input coding might need to
> > > be
> > > kept
> > > for a while for compatability.
> > > 
> > > 4/ the messages my hack generates should be turned into 1 warning
> > > or
> > > information message per language or maybe suppressed altogether.
> > > If
> > > someone is generating a map with a character set that doesn't
> > > support
> > > a
> > > particular language, they really won't care that that data for
> > > other
> > > languages that have an incompatible representation with their
> > > language
> > > won't be there.
> > > 
> > > Ticker
> > > 
> > > On Wed, 2019-12-18 at 19:08 +0000, Gerd Petermann wrote:
> > > > Hi Ticker,
> > > > 
> > > > I think I understand now why we didn't have a default typ file
> > > > ;)
> > > > If I got that right I should revert the changes in r4395 and
> > > > mkgmap
> > > > should not allow or warn loudly when a typ file with a
> > > > different
> > > > codepage is merged?
> > > > Or should we force the usage of unicode codepage?
> > > > Or is it possible to compile mapnik.txt with cp 1252 (or any
> > > > other)
> > > > in a way that only those lines which contain non-matching
> > > > characters
> > > > are ignored?
> > > > 
> > > > Gerd
> > > > 
> > > > 
> > > > ________________________________________
> > > > Von: mkgmap-dev <mkgmap-dev-boun...@lists.mkgmap.org.uk> im
> > > > Auftrag
> > > > von Ticker Berkin <rwb-mkg...@jagit.co.uk>
> > > > Gesendet: Mittwoch, 18. Dezember 2019 19:46
> > > > An: mkgmap development
> > > > Betreff: [mkgmap-dev] TYP files and character encoding
> > > > 
> > > > Hi
> > > > 
> > > > A couple of problems with typ-files and unicode.
> > > > 
> > > > With 'Codepage=65001' the final contents of the labels in
> > > > mapnik.typ
> > > > that is included with the composite map is unicode, but if the
> > > > map
> > > > is
> > > > codepage 1252, the unicode characters with the top bit set are
> > > > simply
> > > > displayed as if in 1252.
> > > > 
> > > > Removing the codepage statement from mapnik.txt and making
> > > > fixes
> > > > elsewhere to ensure that the file is read correctly as utf-8
> > > > and
> > > > then
> > > > generating a map with --code-page=1252, it gives the error:
> > > > 
> > > > SEVE: uk.me.parabola.imgfmt.MapFailedException
> > > >  ../svn/trunk/resources/typ-files/mapnik.txt:
> > > >  (thrown in TypCompiler.makeMap())
> > > >  TYP file cannot be written in code page 1252
> > > > 
> > > > Changing the exception handling in
> > > > imgfmt/app/typ/TypElement.java,
> > > > so
> > > > that makeLabelBlock() reads as
> > > > ...
> > > >     CharBuffer cb = CharBuffer.wrap(tl.getText());
> > > >     try {
> > > >         ByteBuffer buffer = encoder.encode(cb);
> > > >         out.put((byte) tl.getLang());
> > > >         out.put(buffer);
> > > >         out.put((byte) 0);
> > > >      }  catch (CharacterCodingException ignore) {
> > > > //        ignore.printStackTrace();
> > > >         String name = encoder.charset().name();
> > > >         System.out.println("Cannot represent String=" +
> > > >             tl.getLang() + "," + tl.getText() +
> > > >             " in CodePage=" + name);
> > > > //        throw newTypLabelException(name);
> > > >      }
> > > > ...
> > > > 
> > > > It gives output like:
> > > > Cannot represent String=21,Gara|e in CodePage=windows-1252
> > > > Cannot represent String=21,Obszar przemysBowy in
> > > > CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,ZieleD in CodePage=windows-1252
> > > > Cannot represent String=21,Zaro[la in CodePage=windows-1252
> > > > Cannot represent String=21,MokradBa in CodePage=windows-1252
> > > > Cannot represent String=21,Droga wojew\363dzka (B^Ecznik) in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik) in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik) in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Zcie|ka rowerowa in CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Wybrze|e in CodePage=windows-1252
> > > > Cannot represent String=21,Zcie|ka in CodePage=windows-1252
> > > > Cannot represent String=21,StrumieD in CodePage=windows-1252
> > > > Cannot represent String=21,Granica paDstwa in CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Rzeka, KanaB in CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,StrumieD in CodePage=windows-1252
> > > > Cannot represent String=21,Ruroci^Eg in CodePage=windows-1252
> > > > Cannot represent String=21,Kabel wysokiego napi^Ycia in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Tor wy[cigowy in CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik) in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Droga krajowa (B^Ecznik) in
> > > > CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Droga wojew\363dzka (B^Ecznik) in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Wie[ (>5 tys.) in CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Wie[ (>5 tys.) in CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Restauracja (AmerykaDska) in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Restauracja (ChiDska) in
> > > > CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Restauracja (Mi^Ydzynarodowa) in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Restauracja (WBoska) in
> > > > CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Restauracja (MeksykaDska) in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Restauracja (P^Eczki) in
> > > > CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Restauracja (WegetariaDska) in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Kr^Ygle in CodePage=windows-1252
> > > > Cannot represent String=21,Sklep odzie|owy in CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Wypo|yczalnia samochod\363w in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Gara| in CodePage=windows-1252
> > > > Cannot represent String=21,Sprzeda| samochod\363w in
> > > > CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Sklep |eglarski in CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,S^Ed in CodePage=windows-1252
> > > > Cannot represent String=21,O[rodek kultury in CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Wi^Yzienie in CodePage=windows-1252
> > > > Cannot represent String=21,Stra| po|arna in CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,SBupek in CodePage=windows-1252
> > > > Cannot represent String=21,PrzystaD in CodePage=windows-1252
> > > > Cannot represent String=21,L^Edowisko helikopterowe in
> > > > CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Wie|a in CodePage=windows-1252
> > > > Cannot represent String=21,yr\363dBo in CodePage=windows-1252
> > > > Cannot represent String=21,Pla|a in CodePage=windows-1252
> > > > Cannot represent String=21,Przyl^Edek in CodePage=windows-1252
> > > > Cannot represent String=21,SkaBa in CodePage=windows-1252
> > > > 
> > > > Which makes sense if codepage 1252 doesn't handle Polish (hex
> > > > 0x15,
> > > > decimal 21).
> > > > 
> > > > NB the non ascii characters in above are messed up by my
> > > > cutting
> > > > and
> > > > pasting.
> > > > 
> > > > Checking the French, on my Garmin device, the type descriptions
> > > > now
> > > > display accents correctly.
> > > > 
> > > > Ticker
> > > > 
> > > > _______________________________________________
> > > > mkgmap-dev mailing list
> > > > mkgmap-dev@lists.mkgmap.org.uk
> > > > http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
> > > > _______________________________________________
> > > > mkgmap-dev mailing list
> > > > mkgmap-dev@lists.mkgmap.org.uk
> > > > http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
> > > _______________________________________________
> > > mkgmap-dev mailing list
> > > mkgmap-dev@lists.mkgmap.org.uk
> > _______________________________________________
> > mkgmap-dev mailing list
> > mkgmap-dev@lists.mkgmap.org.uk
> > http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev

Index: src/uk/me/parabola/imgfmt/app/typ/TYPFile.java
===================================================================
--- src/uk/me/parabola/imgfmt/app/typ/TYPFile.java	(revision 4413)
+++ src/uk/me/parabola/imgfmt/app/typ/TYPFile.java	(working copy)
@@ -121,12 +121,13 @@
 					// If we succeeded then note offsets for indexes
 					strToType.put(off, type);
 					typeToStr.put(type, off);
-
+					writer.put1u(0);
 				} catch (CharacterCodingException ignore) {
+					//ignore.printStackTrace();
 					String name = encoder.charset().name();
-					throw new TypLabelException(name);
+					log.warn("Cannot represent icon String", label, "in CodePage", name);
+					//throw new TypLabelException(name);
 				}
-				writer.put1u(0);
 			}
 		}
 		Utils.closeFile(writer);
Index: src/uk/me/parabola/imgfmt/app/typ/TypData.java
===================================================================
--- src/uk/me/parabola/imgfmt/app/typ/TypData.java	(revision 4413)
+++ src/uk/me/parabola/imgfmt/app/typ/TypData.java	(working copy)
@@ -17,6 +17,7 @@
 import java.util.List;
 
 import uk.me.parabola.imgfmt.app.srt.Sort;
+import uk.me.parabola.log.Logger;
 
 /**
  * Holds all the data for a typ file.
@@ -24,6 +25,8 @@
  * @author Steve Ratcliffe
  */
 public class TypData {
+	private static final Logger log = Logger.getLogger(TypData.class);
+
 	private final ShapeStacking stacking = new ShapeStacking();
 	private final TypParam param = new TypParam();
 	private final List<TypPolygon> polygons = new ArrayList<TypPolygon>();
@@ -51,10 +54,11 @@
 			if (origCodepage != 0) {
 				if (origCodepage != sort.getCodepage()) {
 					// This is just a warning, not a definite problem
-					System.out.println("WARNING: SortCode in TYP txt file different from" +
-							" command line setting");
+					// and is to be expected if have general UTF-8 TYP.txt
+					log.warn("CodePage in TYP txt file:", sort.getCodepage(), "different from --code-page:", origCodepage);
 				}
 			}
+			return;  // want to use the command line one
 		}
 		this.sort = sort;
 		encoder = sort.getCharset().newEncoder();
Index: src/uk/me/parabola/imgfmt/app/typ/TypElement.java
===================================================================
--- src/uk/me/parabola/imgfmt/app/typ/TypElement.java	(revision 4413)
+++ src/uk/me/parabola/imgfmt/app/typ/TypElement.java	(working copy)
@@ -20,6 +20,7 @@
 import java.util.List;
 
 import uk.me.parabola.imgfmt.app.ImgFileWriter;
+import uk.me.parabola.log.Logger;
 
 /**
  * Base routines and data used by points, lines and polygons.
@@ -30,6 +31,8 @@
  * @author Steve Ratcliffe
  */
 public abstract class TypElement implements Comparable<TypElement> {
+	private static final Logger log = Logger.getLogger(TypElement.class);
+
 	private int type;
 	private int subType;
 
@@ -124,17 +127,19 @@
 	protected ByteBuffer makeLabelBlock(CharsetEncoder encoder) {
 		ByteBuffer out = ByteBuffer.allocate(256 * labels.size());
 		for (TypLabel tl : labels) {
-			out.put((byte) tl.getLang());
 			CharBuffer cb = CharBuffer.wrap(tl.getText());
 			try {
 				ByteBuffer buffer = encoder.encode(cb);
+				out.put((byte) tl.getLang());
 				out.put(buffer);
+				out.put((byte) 0);
 			} catch (CharacterCodingException ignore) {
+				//ignore.printStackTrace();
 				String name = encoder.charset().name();
 				//System.out.println("cs " + name);
-				throw new TypLabelException(name);
+				log.warn("Cannot represent String", tl.getText(), "for language", tl.getLang(), "in CodePage", name);
+				//throw new TypLabelException(name);
 			}
-			out.put((byte) 0);
 		}
 
 		return out;
Index: src/uk/me/parabola/mkgmap/main/TypCompiler.java
===================================================================
--- src/uk/me/parabola/mkgmap/main/TypCompiler.java	(revision 4413)
+++ src/uk/me/parabola/mkgmap/main/TypCompiler.java	(working copy)
@@ -21,11 +21,13 @@
 import java.io.InputStreamReader;
 import java.io.Reader;
 import java.io.UnsupportedEncodingException;
+import java.nio.ByteBuffer;
 import java.nio.CharBuffer;
 import java.nio.channels.FileChannel;
 import java.nio.charset.CharacterCodingException;
 import java.nio.charset.Charset;
 import java.nio.charset.CharsetEncoder;
+import java.nio.charset.CharsetDecoder;
 import java.nio.charset.StandardCharsets;
 import java.nio.file.StandardOpenOption;
 
@@ -85,7 +87,7 @@
 			param.setFamilyId(family);
 		if (product != -1)
 			param.setProductId(product);
-		if (cp != -1 && param.getCodePage() == 0)
+		if (cp != -1)
 			param.setCodePage(cp);
 
 		File outFile = new File(filename);
@@ -134,7 +136,7 @@
 		try {
 			Reader r = new BufferedReader(new InputStreamReader(new FileInputStream(filename), charset));
 			try {
-				tr.read(filename, r);
+				tr.read(filename, r, charset);
 			} finally {
 				Utils.closeFile(r);
 			}
@@ -204,79 +206,98 @@
 
 
 	class CharsetProbe {
-		private String codePage;
-		private CharsetEncoder encoder;
+		// TODO: this should could be moved to somewhere like util and used on other text files
+		// except looking for Codepage is particular to Typ files
+		// and want to have ability to return default environment decoder
+		// (ie inputStream without 2nd parameter)
 
-		public CharsetProbe() {
-			setCodePage("latin1");
-		}
+		private String probeCharset(String file) {
 
-		private void setCodePage(String codePage) {
-			if ("cp65001".equalsIgnoreCase(codePage)) {
-				this.codePage = "utf-8";
-				this.encoder = StandardCharsets.UTF_8.newEncoder();
-			} else {
-				this.codePage = codePage;
-				this.encoder = Charset.forName(codePage).newEncoder();
-			}
-		}
+			final String BOM_UTF_8    = "\u00EF\u00BB\u00BF";
+			final String BOM_UTF_16LE = "\u00FF\u00FE";
+			final String BOM_UTF_16BE = "\u00FE\u00FF";
+			final String BOM_UTF_32LE = "\u00FF\u00FE\u0000\u0000";
+			final String BOM_UTF_32BE = "\u0000\u0000\u00FE\u00FF";
 
-		private String probeCharset(String file) {
-			String readingCharset = "utf-8";
+			final Charset byteCharNoMap = StandardCharsets.ISO_8859_1; // byteVal == charVal
+			final CharsetDecoder utf8Decoder = StandardCharsets.UTF_8.newDecoder();
 
+			String charset = null;
+			InputStream is = null;
 			try {
-				tryCharset(file, readingCharset);
-				return readingCharset;
-			} catch (TypLabelException e) {
+				is = new FileInputStream(file);
+			} catch (FileNotFoundException e) {
+				throw new ExitException("File not found " + file);
+			}
+			BufferedReader br = new BufferedReader(new InputStreamReader(is, byteCharNoMap));
+			String line;
+			int lineNo = 0;
+			boolean validUTF8 = true;
+			do {
 				try {
-					readingCharset = e.getCharsetName();
-					tryCharset(file, readingCharset);
-				} catch (Exception e1) {
-					return "utf-8";
+					line = br.readLine();
+				} catch (IOException e) {
+					throw new ExitException("Unable to read file " + file);
 				}
-			}
+				if (line == null)
+					break;
+				++lineNo;
+				if (line.isEmpty())
+					continue;
+				if (lineNo <= 2) { // only check the first few lines for these
+					if (line.contains(BOM_UTF_8))
+						charset = "UTF-8";
+					else if (line.contains(BOM_UTF_32LE)) // must test _32 before _16
+						charset = "UTF-32LE";
+					else if (line.contains(BOM_UTF_32BE))
+						charset = "UTF-32BE";
+					else if (line.contains(BOM_UTF_16LE))
+						charset = "UTF-16LE";
+					else if (line.contains(BOM_UTF_16BE))
+						charset = "UTF-16BE";
+					if (charset != null)
+						break;
 
-			return readingCharset;
-		}
+					int strInx = line.indexOf("-*- coding:"); // be lax about start/end
+					if (strInx >= 0) {
+						charset = line.substring(strInx+11).trim();
+						strInx = charset.indexOf(' ');
+						if (strInx >= 0)
+							charset = charset.substring(0, strInx);
+						break;
+					}
+				}
 
-		private void tryCharset(String file, String readingCharset) {
-
-			try (InputStream is = new FileInputStream(file); BufferedReader br = new BufferedReader(new InputStreamReader(is, readingCharset))) {
-
-				String line;
-				while ((line = br.readLine()) != null) {
-					if (line.isEmpty())
-						continue;
-
-					// This is a giveaway the file is in utf-something, so ignore anything else
-					if (line.charAt(0) == 0xfeff)
-						return;
-
-					if (line.startsWith("CodePage=")) {
-						String[] split = line.split("=");
-						try {
-							if (split.length > 1)
-								setCodePage("cp" + Integer.decode(split[1].trim()));
-						} catch (NumberFormatException e) {
-							setCodePage("cp1252");
-						}
+				// special for TypFile; to be compatible with possible old usage
+				if (line.startsWith("CodePage=")) {
+					charset = line.substring(9).trim();
+					try {
+						int codePage = Integer.decode(charset);
+						if (codePage == 65001)
+							charset = "UTF-8";
+						else
+							charset = "cp" + codePage;
+					} catch (NumberFormatException e) {
 					}
+					break;
+				}
 
-					if (line.startsWith("String")) {
-						CharBuffer cb = CharBuffer.wrap(line);
-						if (encoder != null)
-							encoder.encode(cb);
+				if (validUTF8) { // test the line for being valid UTF-8
+					ByteBuffer asBytes = byteCharNoMap.encode(line);
+					try { // arbitary sequences of bytes > 127 tend not to be UTF8
+						/*CharBuffer asChars =*/ utf8Decoder.decode(asBytes);
+					} catch (CharacterCodingException e) {
+						validUTF8 = false;
+						// don't stop as might still get coding directive
 					}
 				}
-			} catch (UnsupportedEncodingException | CharacterCodingException e) {
-				throw new TypLabelException(codePage);
 
-			} catch (FileNotFoundException e) {
-				throw new ExitException("File not found " + file);
-
+			} while (true);
+			try {
+				is.close();
 			} catch (IOException e) {
-				throw new ExitException("Could not read file " + file);
 			}
+			return charset != null ? charset : (validUTF8 ? "UTF-8" : "cp1252");
 		}
 	}
 }
Index: src/uk/me/parabola/mkgmap/scan/TokenScanner.java
===================================================================
--- src/uk/me/parabola/mkgmap/scan/TokenScanner.java	(revision 4413)
+++ src/uk/me/parabola/mkgmap/scan/TokenScanner.java	(working copy)
@@ -28,6 +28,7 @@
  */
 public class TokenScanner {
 	private static final int NO_PUSHBACK = 0;
+	private String charset = "utf-8";
 
 	// Reading state
 	private final Reader reader;
@@ -53,6 +54,10 @@
 		fileName = filename;
 	}
 
+	public void setCharset(String charset) {
+		this.charset = charset;
+	}
+
 	/**
 	 * Peek and return the first token.  It is not consumed.
 	 */
@@ -236,7 +241,7 @@
 		try {
 			c = reader.read();
 			if (c == 0xfffd)
-				throw new SyntaxException(this, "Bad character in input, file probably not in utf-8");
+				throw new SyntaxException(this, "Bad character in input, file probably not in " + charset);
 		} catch (IOException e) {
 			isEOF = true;
 			c = -1;
Index: src/uk/me/parabola/mkgmap/typ/IdSection.java
===================================================================
--- src/uk/me/parabola/mkgmap/typ/IdSection.java	(revision 4413)
+++ src/uk/me/parabola/mkgmap/typ/IdSection.java	(working copy)
@@ -42,7 +42,8 @@
 		} else if (name.equalsIgnoreCase("ProductCode")) {
 			data.setProductId(ival);
 		} else if (name.equalsIgnoreCase("CodePage")) {
-			data.setSort(SrtTextReader.sortForCodepage(ival));
+			if (data.getSort() == null) // ignore if --code-page
+				data.setSort(SrtTextReader.sortForCodepage(ival));
 		} else {
 			throw new SyntaxException(scanner, "Unrecognised keyword in id section: " + name);
 		}
Index: src/uk/me/parabola/mkgmap/typ/TypTextReader.java
===================================================================
--- src/uk/me/parabola/mkgmap/typ/TypTextReader.java	(revision 4413)
+++ src/uk/me/parabola/mkgmap/typ/TypTextReader.java	(working copy)
@@ -32,9 +32,10 @@
 	// As the file is read in, the information is saved into this data structure.
 	private final TypData data = new TypData();
 
-	public void read(String filename, Reader r) {
+	public void read(String filename, Reader r, String charset) {
 		TokenScanner scanner = new TokenScanner(filename, r);
 		scanner.setCommentChar(null); // the '#' comment character is not appropriate for this file
+		scanner.setCharset(charset);
 
 		ProcessSection currentSection = null;

_______________________________________________
mkgmap-dev mailing list
mkgmap-dev@lists.mkgmap.org.uk
http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev

Re: [mkgmap-dev] TYP files and character encoding

Reply via email to