Hi Michael, I was just looking at this component for another purpose and it looks to me that fixed length tokenzation occurs here:
https://github.com/apache/camel/blob/master/components/camel-bindy/src/main/java/org/apache/camel/dataformat/bindy/BindyFixedLengthFactory.java#L212..L216 So, It counts in java chars and not code points. You can maybe experiment injecting a custom BindyFixedLengthFactory, via dataFormat.setModelFactory(..). Would you feel that an extension point to customize count/selection of chars/codepoint/grapheme would be valuable to the community, feel free to raise a JIRA ticket. Alex On Fri, Jan 24, 2020 at 9:52 AM Michael Greulich <mich...@greulich-online.eu> wrote: > Hi, > > I’m having problems with the bindy component and wonder if there is > something I missed. Maybe one can help me addressing it. I cannot believe, > that I’m the first to hit this problem. > > I need to port an EAI application built using bindy, that reads a fixed > type file(*) converts it and sends the data somewhere else. Currently this > file is in Latin 1 encoding, but we need to take it to Unicode – > effectively UTF-8. We have an ugly, but effectively unavoidable legacy > application that creates the file. > > Unicode is a bit tricky, when it comes to counting the length of a string > specially since Java uses internally UTF-16, which means depending on the > codepoint 1 – 2 (Java-)chars. Bindy seems to use internally for selection > substring and counts chars like Java does. This means the length of a > string is the count of the chars, i.e. UTF-16 surrogates, but not > codepoints, which is the common denominator (e.g. see definition of string > length in XMLSchema). And when one takes combing chars into account (one > “base char” plus 0 – n combining chars are perceived as one “char” by > users) it becomes even more of a problem. > > Is there a possibility to tell bindy how it counts an and selects the > tokens based on char counts in a given line? Any suggestions? Is the are > related bug or change to come that addresses this problem? > > -- Mik > > (*) This means, that on certain positions there start certain data > (columns if you will). > >