Bindy plus Unicode

Michael Greulich Fri, 24 Jan 2020 00:53:04 -0800

Hi, 

I’m having problems with the bindy component and wonder if there is something I 
missed. Maybe one can help me addressing it. I cannot believe, that I’m the 
first to hit this problem.


I need to port an EAI application built using bindy, that reads a fixed type 
file(*) converts it and sends the data somewhere else. Currently this file is 
in Latin 1 encoding, but we need to take it to Unicode – effectively UTF-8. We 
have an ugly, but effectively unavoidable legacy application that creates the 
file. 

Unicode is a bit tricky, when it comes to counting the length of a string 
specially since Java uses internally UTF-16, which means depending on the 
codepoint 1 – 2 (Java-)chars. Bindy seems to use internally for selection 
substring and counts chars like Java does. This means the length of a string is 
the count of the chars, i.e. UTF-16 surrogates, but not codepoints, which is 
the common denominator (e.g. see definition of string length in XMLSchema). And 
when one takes combing chars into account (one “base char” plus 0 – n combining 
chars are perceived as one “char” by users) it becomes even more of a problem. 

Is there a possibility to tell bindy how it counts an and selects the tokens 
based on char counts in a given line? Any suggestions? Is the are related bug 
or change to come that addresses this problem?

-- Mik 

(*) This means, that on certain positions there start certain data (columns if 
you will).

Bindy plus Unicode

Reply via email to