Markus Neteler wrote: > >> I am writing v.in.geonames to easily read in data from > >> http://download.geonames.org/export/dump/ > >> > >> The script is essentially using v.in.ascii to read in the CSV file encoded > >> in UTF-8 Unicode text. There are placenames in various languages including > >> Japanese. > >> v.in.ascii isn't able to read them properly and fails on such lines, > >> example:
> >> How to fix this problem? > > > > Can you please provide accurate and sufficient information about the > > problem? > > As always, I try. As a general rule, if you're having problems with input which contains non-ASCII text, use an attachment. The files appear to be UTF-8, but your previous email used ISO-2022. They may seem "equivalent" to your mail program, but they may not be equivalent so far as e.g. v.in.ascii is concerned. In this case, I don't think that encodings or non-ASCII characters are actuallly the problem. However, if the data had actually been encoded in ISO-2022, it could have been related. > > I.e. the exact data being fed to v.in.ascii (*before* it has been > > mangled by the various components of the email chain), the v.in.ascii > > command which is failing, etc. > > Attached the original file reduced to 1 offending line: > wget http://download.geonames.org/export/dump/IT.zip > cd /tmp > unzip IT.zip > grep 'Italian Republic' /tmp/IT.txt > /tmp/IT_example.csv > > Import into LatLong location, replicating the script functionality > > v.in.ascii cat=0 x=6 y=5 fs=tab in=/tmp/IT_example.csv out=test > columns='geonameid integer, name varchar(200), asciiname varchar(200), > alternatename varchar(4000), latitude double precision, longitude > double precision, featureclass varchar(1), featurecode varchar(10), > countrycode varchar(2), cc2 varchar(60), admin1code varchar(20), > admin2code varchar(20), admin3code varchar(20), admin4code > varchar(20), population integer, elevation varchar(5), gtopo30 > integer, timezone varchar(50), modification date' --o > Scanning input for column types... > ERROR: Unparsable latitude value in column <4>: PCLI I don't get this particular error, but I do have some other problems. First, I had to increase the buffer size: --- vector/v.in.ascii/points.c (revision 31901) +++ vector/v.in.ascii/points.c (working copy) @@ -74,7 +74,7 @@ char *coorbuf, *tmp_token, *sav_buf; int skip = FALSE, skipped = 0; - buflen = 1000; + buflen = 4000; buf = (char *)G_malloc(buflen); buf_raw = (char *)G_malloc(buflen); coorbuf = (char *)G_malloc(256); Otherwise, the input was truncated in the middle of the list of translated names. This caused points_analyse[1] to see too few columns, resulting in: Scanning input for column types... Maximum input row length: 999 Maximum number of columns: 11 Minimum number of columns: 4 ERROR: x column number > minimum last column number (incorrect field separator?) Fixing that, it now complains about: Scanning input for column types... Maximum input row length: 1309 Maximum number of columns: 14 Minimum number of columns: 14 WARNING: Table <test> linked to vector map <test> does not exist ERROR: Number of columns defined (19) does not match number of columns (14) This is caused by G_tokenize() skipping leading whitespace, including tabs, even when the separator is a tab. Consequently, a run of consecutive blank fields is interpreted as a single blank field. After fixing that, I get: Scanning input for column types... Maximum input row length: 1309 Maximum number of columns: 19 Minimum number of columns: 19 WARNING: Column number 11 <admin1code> defined as string has only integer values Importing points... Segmentation fault (core dumped) This is caused by overflowing another 1000-byte buffer in points_to_bin(): --- vector/v.in.ascii/points.c (revision 31911) +++ vector/v.in.ascii/points.c (working copy) @@ -269,7 +269,7 @@ int *coltype, int xcol, int ycol, int zcol, int catcol, int skip_lines) { - char *buf, buf2[1000]; + char *buf, buf2[4000]; int cat = 0; int row = 1; struct line_pnts *Points; After which, the file appears to import without any problems. I have committed a fix to G_tokenize(), and also enlarged the buffers in v.in.ascii to 4000 bytes (although removing fixed limits altogether would be better). [1] BTW, don't we normally use US-English spellings, i.e. "analyze" instead of "analyse"? -- Glynn Clements <[EMAIL PROTECTED]> _______________________________________________ grass-dev mailing list grass-dev@lists.osgeo.org http://lists.osgeo.org/mailman/listinfo/grass-dev