Some encoding bugs in shape file writer
---------------------------------------

                 Key: GEOT-2999
                 URL: http://jira.codehaus.org/browse/GEOT-2999
             Project: GeoTools
          Issue Type: Bug
          Components: data shapefile
    Affects Versions: 2.6.2
         Environment: Windows XP, Java 6 Update 12, default encoding is 
windows-1257
            Reporter: Jaan Vajakas
            Assignee: Andrea Aime
         Attachments: UTF8SHPWritingTest.java

There are at least three bugs in 
DbaseFileWriter.FieldFormatter.getFieldString(int size, String s):
* if the encoding of the output shapefile is UTF-8 and the value written to it 
has more bytes than the field length but less characters than the field length 
then a StringIndexOutOfBoundsException may occur;
* if the encoding of the output shapefile is UTF-8 and the value written to it 
has more bytes than the field length then an empty string may be written, for 
two different reasons.

See the attached testcase (the source is in UTF-8 encoding).

It seems to me that it would be better to use byte arrays or byte buffers 
instead of strings as the return values of the getFieldString(...) methods 
because
* this way we could use java.nio.charset.CharsetEncoder.encode(CharBuffer, 
ByteBuffer, boolean) to encode as many characters as possible, so that 
getFieldString(int size, String s) would only have to pad the remaining spaces;
* performance might improve, as encoding would take place only one time per 
value instead of the two or three times it takes now, and less temporary 
objects would be created;
* the nature of the return values of the getFieldString(...) methods seems to 
be bytes rather than characters as suggested e.g. by the following comment in 
DbaseFileWriter.java: "Adding the charset to getBytes causes the output to get 
altered for the '@: Timestamp' field. And using getBytes returns a different 
array in 64-bit platforms so we expect chars and cast to byte just before 
writing.".

And what about UTF-16? Can a DBF file be encoded in a non-ASCII-compliant 
encoding? It seems that currently geotools lets you set UTF-16 as DBF output 
encoding but it only writes zeros instead of data.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://jira.codehaus.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

------------------------------------------------------------------------------
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Geotools-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/geotools-devel

Reply via email to