Hi Michael,

    I was just looking at this component for another purpose and it looks
to me that fixed length tokenzation occurs here:

https://github.com/apache/camel/blob/master/components/camel-bindy/src/main/java/org/apache/camel/dataformat/bindy/BindyFixedLengthFactory.java#L212..L216
  So, It counts in java chars and not code points. You can maybe experiment
injecting a custom BindyFixedLengthFactory, via
dataFormat.setModelFactory(..).

  Would you feel that an extension point to customize count/selection of
chars/codepoint/grapheme would be valuable to the community, feel free to
raise a JIRA ticket.

Alex


On Fri, Jan 24, 2020 at 9:52 AM Michael Greulich <mich...@greulich-online.eu>
wrote:

> Hi,
>
> I’m having problems with the bindy component and wonder if there is
> something I missed. Maybe one can help me addressing it. I cannot believe,
> that I’m the first to hit this problem.
>
> I need to port an EAI application built using bindy, that reads a fixed
> type file(*) converts it and sends the data somewhere else. Currently this
> file is in Latin 1 encoding, but we need to take it to Unicode –
> effectively UTF-8. We have an ugly, but effectively unavoidable legacy
> application that creates the file.
>
> Unicode is a bit tricky, when it comes to counting the length of a string
> specially since Java uses internally UTF-16, which means depending on the
> codepoint 1 – 2 (Java-)chars. Bindy seems to use internally for selection
> substring and counts chars like Java does. This means the length of a
> string is the count of the chars, i.e. UTF-16 surrogates, but not
> codepoints, which is the common denominator (e.g. see definition of string
> length in XMLSchema). And when one takes combing chars into account (one
> “base char” plus 0 – n combining chars are perceived as one “char” by
> users) it becomes even more of a problem.
>
> Is there a possibility to tell bindy how it counts an and selects the
> tokens based on char counts in a given line? Any suggestions? Is the are
> related bug or change to come that addresses this problem?
>
> -- Mik
>
> (*) This means, that on certain positions there start certain data
> (columns if you will).
>
>

Reply via email to