RE: Re: Bindy plus Unicode

Michael Greulich Fri, 24 Jan 2020 08:20:18 -0800


Hi Alex,

well, your comment was already very helpful. I created a custom DataFormat and 
ModelFactory from the default ones for FixedLength. Of course I obeyed the 
license terms of the Apache license ;-) For some aspect of recognizing chars, I 
used the ICU4J-lib, because the support for some things (e.g. emojis) in the 
Java runtime is not up to date. The license of ICU it quite permitting, too. 
I’ve no idea, if this is a problem for an Apache project...

Well I think I’m not the only one, that has this use-case -- so I  think this 
can be useful for the community, too. Currently I’m under pressure, but I think 
I will create a JIRA ticket when the stress has become less. If the community 
is interested, I can provide the code of my solution and would be glad if this 
thing goes upstream (i.e. into the camel distro) some day. 

Currently we (the company I work for) are using Camel 2.2 and I guess this will 
be the case for some time. If this feature or bug (not very determined what it 
actually is, I will leave the decision to the community)  in which version will 
it be included? Only Camel 3.x or will it be backported to 2.2?

-- Mik

--------------------------------------------------------------------------
Gesendet: Freitag, 24. Januar 2020 um 11:43 Uhr
Von: "Alex Dettinger" <aldettin...@gmail.com>
An: users@camel.apache.org
Betreff: Re: Bindy plus Unicode
Hi Michael,

I was just looking at this component for another purpose and it looks
to me that fixed length tokenzation occurs here:

https://github.com/apache/camel/blob/master/components/camel-bindy/src/main/java/org/apache/camel/dataformat/bindy/BindyFixedLengthFactory.java#L212..L216
So, It counts in java chars and not code points. You can maybe experiment
injecting a custom BindyFixedLengthFactory, via
dataFormat.setModelFactory(..).

Would you feel that an extension point to customize count/selection of
chars/codepoint/grapheme would be valuable to the community, feel free to
raise a JIRA ticket.

Alex

On Fri, Jan 24, 2020 at 9:52 AM Michael Greulich <mich...@greulich-online.eu>
wrote:

> Hi,
>
> I’m having problems with the bindy component and wonder if there is
> something I missed. Maybe one can help me addressing it. I cannot believe,
> that I’m the first to hit this problem.
>
> I need to port an EAI application built using bindy, that reads a fixed
> type file(*) converts it and sends the data somewhere else. Currently this
> file is in Latin 1 encoding, but we need to take it to Unicode –
> effectively UTF-8. We have an ugly, but effectively unavoidable legacy
> application that creates the file.
>
> Unicode is a bit tricky, when it comes to counting the length of a string
> specially since Java uses internally UTF-16, which means depending on the
> codepoint 1 – 2 (Java-)chars. Bindy seems to use internally for selection
> substring and counts chars like Java does. This means the length of a
> string is the count of the chars, i.e. UTF-16 surrogates, but not
> codepoints, which is the common denominator (e.g. see definition of string
> length in XMLSchema). And when one takes combing chars into account (one
> “base char” plus 0 – n combining chars are perceived as one “char” by
> users) it becomes even more of a problem.
>
> Is there a possibility to tell bindy how it counts an and selects the
> tokens based on char counts in a given line? Any suggestions? Is the are
> related bug or change to come that addresses this problem?
>
> -- Mik
>
> (*) This means, that on certain positions there start certain data
> (columns if you will).
>
>

RE: Re: Bindy plus Unicode

Reply via email to