Re: Re: Re: Bindy plus Unicode

Alex Dettinger Mon, 27 Jan 2020 01:36:26 -0800

Hi Michael,

  You would need to open a PR against master.
  Please, find some helpful information around contributions
https://camel.apache.org/manual/latest/contributing.html.


  I'm sure ICU4J is functionally great. However, license compatibility is a
legal matter, we don't really have choice.
  Could you please point to the ICU4J license you've been using ? I could
have a try with checking the compatibility.

Alex

On Sat, Jan 25, 2020 at 5:42 PM <d...@greulich-online.eu> wrote:

>
> Hi Alex,
>
> well, which would then be the appropriate branch? Master or 3.x?
> I guess if i create a ticket I get informed by e-mail what happens to the
> thing, right?
> I think there could be a ticket + PR in the next two weeks.
>
> I word on ICU4J. Of course I understand, that an Apache Project has to be
> careful, but there
> are features like splitting strings into graphemes, that need features,
> the old logic in the JDK
> doesn't support. The lib is very common (e.g. LibreOffice uses it) and
> AFAIK the de-facto standard
> for working with elaborate Unicode.
>
> -- Mik
>
> ----
> Gesendet: Freitag, 24. Januar 2020 um 19:15 Uhr
> Von: "Alex Dettinger" <aldettin...@gmail.com>
> An: users@camel.apache.org
> Betreff: Re: Re: Bindy plus Unicode
> Hi Michael,
>
> Good to know that you sorted it out :) The compatibility between the
> ICU4L and Apache License is not straightforward, we would need to look
> closer.
> Still creating a quick ticket and sharing a github project would make it
> possible to save your work, and may be of interest later on to the
> community.
> Would one provide a PR against 3.x, chances are that this could be
> back-ported to 2.x. Please, keep time frame in mind as 2.x may close end of
> this year.
>
> Alex
>
> On Fri, Jan 24, 2020 at 5:20 PM Michael Greulich <d...@greulich-online.eu>
> wrote:
>
> >
> > Hi Alex,
> >
> > well, your comment was already very helpful. I created a custom
> DataFormat
> > and ModelFactory from the default ones for FixedLength. Of course I
> obeyed
> > the license terms of the Apache license ;-) For some aspect of
> recognizing
> > chars, I used the ICU4J-lib, because the support for some things (e.g.
> > emojis) in the Java runtime is not up to date. The license of ICU it
> quite
> > permitting, too. I’ve no idea, if this is a problem for an Apache
> project...
> >
> > Well I think I’m not the only one, that has this use-case -- so I think
> > this can be useful for the community, too. Currently I’m under pressure,
> > but I think I will create a JIRA ticket when the stress has become less.
> If
> > the community is interested, I can provide the code of my solution and
> > would be glad if this thing goes upstream (i.e. into the camel distro)
> some
> > day.
> >
> > Currently we (the company I work for) are using Camel 2.2 and I guess
> this
> > will be the case for some time. If this feature or bug (not very
> determined
> > what it actually is, I will leave the decision to the community) in which
> > version will it be included? Only Camel 3.x or will it be backported to
> 2.2?
> >
> > -- Mik
> >
> >
> --------------------------------------------------------------------------
> > Gesendet: Freitag, 24. Januar 2020 um 11:43 Uhr
> > Von: "Alex Dettinger" <aldettin...@gmail.com>
> > An: users@camel.apache.org
> > Betreff: Re: Bindy plus Unicode
> > Hi Michael,
> >
> > I was just looking at this component for another purpose and it looks
> > to me that fixed length tokenzation occurs here:
> >
> >
> >
> https://github.com/apache/camel/blob/master/components/camel-bindy/src/main/java/org/apache/camel/dataformat/bindy/BindyFixedLengthFactory.java#L212..L216
> > So, It counts in java chars and not code points. You can maybe experiment
> > injecting a custom BindyFixedLengthFactory, via
> > dataFormat.setModelFactory(..).
> >
> > Would you feel that an extension point to customize count/selection of
> > chars/codepoint/grapheme would be valuable to the community, feel free to
> > raise a JIRA ticket.
> >
> > Alex
> >
> >
> > On Fri, Jan 24, 2020 at 9:52 AM Michael Greulich <
> > mich...@greulich-online.eu>
> > wrote:
> >
> > > Hi,
> > >
> > > I’m having problems with the bindy component and wonder if there is
> > > something I missed. Maybe one can help me addressing it. I cannot
> > believe,
> > > that I’m the first to hit this problem.
> > >
> > > I need to port an EAI application built using bindy, that reads a fixed
> > > type file(*) converts it and sends the data somewhere else. Currently
> > this
> > > file is in Latin 1 encoding, but we need to take it to Unicode –
> > > effectively UTF-8. We have an ugly, but effectively unavoidable legacy
> > > application that creates the file.
> > >
> > > Unicode is a bit tricky, when it comes to counting the length of a
> string
> > > specially since Java uses internally UTF-16, which means depending on
> the
> > > codepoint 1 – 2 (Java-)chars. Bindy seems to use internally for
> selection
> > > substring and counts chars like Java does. This means the length of a
> > > string is the count of the chars, i.e. UTF-16 surrogates, but not
> > > codepoints, which is the common denominator (e.g. see definition of
> > string
> > > length in XMLSchema). And when one takes combing chars into account
> (one
> > > “base char” plus 0 – n combining chars are perceived as one “char” by
> > > users) it becomes even more of a problem.
> > >
> > > Is there a possibility to tell bindy how it counts an and selects the
> > > tokens based on char counts in a given line? Any suggestions? Is the
> are
> > > related bug or change to come that addresses this problem?
> > >
> > > -- Mik
> > >
> > > (*) This means, that on certain positions there start certain data
> > > (columns if you will).
> > >
> > >
> >
> >
> >
>
>
>

Re: Re: Re: Bindy plus Unicode

Reply via email to