Hi Michael, Good to know that you sorted it out :) The compatibility between the ICU4L and Apache License is not straightforward, we would need to look closer. Still creating a quick ticket and sharing a github project would make it possible to save your work, and may be of interest later on to the community. Would one provide a PR against 3.x, chances are that this could be back-ported to 2.x. Please, keep time frame in mind as 2.x may close end of this year.
Alex On Fri, Jan 24, 2020 at 5:20 PM Michael Greulich <d...@greulich-online.eu> wrote: > > Hi Alex, > > well, your comment was already very helpful. I created a custom DataFormat > and ModelFactory from the default ones for FixedLength. Of course I obeyed > the license terms of the Apache license ;-) For some aspect of recognizing > chars, I used the ICU4J-lib, because the support for some things (e.g. > emojis) in the Java runtime is not up to date. The license of ICU it quite > permitting, too. I’ve no idea, if this is a problem for an Apache project... > > Well I think I’m not the only one, that has this use-case -- so I think > this can be useful for the community, too. Currently I’m under pressure, > but I think I will create a JIRA ticket when the stress has become less. If > the community is interested, I can provide the code of my solution and > would be glad if this thing goes upstream (i.e. into the camel distro) some > day. > > Currently we (the company I work for) are using Camel 2.2 and I guess this > will be the case for some time. If this feature or bug (not very determined > what it actually is, I will leave the decision to the community) in which > version will it be included? Only Camel 3.x or will it be backported to 2.2? > > -- Mik > > -------------------------------------------------------------------------- > Gesendet: Freitag, 24. Januar 2020 um 11:43 Uhr > Von: "Alex Dettinger" <aldettin...@gmail.com> > An: users@camel.apache.org > Betreff: Re: Bindy plus Unicode > Hi Michael, > > I was just looking at this component for another purpose and it looks > to me that fixed length tokenzation occurs here: > > > https://github.com/apache/camel/blob/master/components/camel-bindy/src/main/java/org/apache/camel/dataformat/bindy/BindyFixedLengthFactory.java#L212..L216 > So, It counts in java chars and not code points. You can maybe experiment > injecting a custom BindyFixedLengthFactory, via > dataFormat.setModelFactory(..). > > Would you feel that an extension point to customize count/selection of > chars/codepoint/grapheme would be valuable to the community, feel free to > raise a JIRA ticket. > > Alex > > > On Fri, Jan 24, 2020 at 9:52 AM Michael Greulich < > mich...@greulich-online.eu> > wrote: > > > Hi, > > > > I’m having problems with the bindy component and wonder if there is > > something I missed. Maybe one can help me addressing it. I cannot > believe, > > that I’m the first to hit this problem. > > > > I need to port an EAI application built using bindy, that reads a fixed > > type file(*) converts it and sends the data somewhere else. Currently > this > > file is in Latin 1 encoding, but we need to take it to Unicode – > > effectively UTF-8. We have an ugly, but effectively unavoidable legacy > > application that creates the file. > > > > Unicode is a bit tricky, when it comes to counting the length of a string > > specially since Java uses internally UTF-16, which means depending on the > > codepoint 1 – 2 (Java-)chars. Bindy seems to use internally for selection > > substring and counts chars like Java does. This means the length of a > > string is the count of the chars, i.e. UTF-16 surrogates, but not > > codepoints, which is the common denominator (e.g. see definition of > string > > length in XMLSchema). And when one takes combing chars into account (one > > “base char” plus 0 – n combining chars are perceived as one “char” by > > users) it becomes even more of a problem. > > > > Is there a possibility to tell bindy how it counts an and selects the > > tokens based on char counts in a given line? Any suggestions? Is the are > > related bug or change to come that addresses this problem? > > > > -- Mik > > > > (*) This means, that on certain positions there start certain data > > (columns if you will). > > > > > > >