Re: DataOutputSerializer serializing long UTF Strings

2024-01-23 Thread Gyula Fóra
Hi Peter!

I think this is a good additional serialization utility to Flink that may
benefit different data formats / connectors in the future.

+1

Cheers,
Gyula

On Mon, Jan 22, 2024 at 8:04 PM Steven Wu  wrote:

> I think this is a reasonable extension to `DataOutputSerializer`. Although
> 64 KB is not small, it is still possible to have long strings over that
> limit. There are already precedents of extended APIs
> `DataOutputSerializer`. E.g.
>
> public void setPosition(int position) {
> Preconditions.checkArgument(
> position >= 0 && position <= this.position, "Position out
> of bounds.");
> this.position = position;
> }
>
> public void setPositionUnsafe(int position) {
> this.position = position;
> }
>
>
> On Fri, Jan 19, 2024 at 2:51 AM Péter Váry 
> wrote:
>
> > Hi Team,
> >
> > During the root cause analysis of an Iceberg serialization issue [1], we
> > have found that *DataOutputSerializer.writeUTF* has a hard limit on the
> > length of the string (64k). This is inherited from the
> > *DataOutput.writeUTF*
> > method, where the JDK specifically defines this limit [2].
> >
> > For our use-case we need to enable the possibility to serialize longer
> UTF
> > strings, so we will need to define a *writeLongUTF* method with a similar
> > specification than the *writeUTF*, but without the length limit.
> >
> > My question is:
> > - Is it something which would be useful for every Flink user? Shall we
> add
> > this method to *DataOutputSerializer*?
> > - Is it very specific for Iceberg, and we should keep it in Iceberg
> > connector code?
> >
> > Thanks,
> > Peter
> >
> > [1] - https://github.com/apache/iceberg/issues/9410
> > [2] -
> >
> >
> https://docs.oracle.com/javase/8/docs/api/java/io/DataOutput.html#writeUTF-java.lang.String-
> >
>


Re: DataOutputSerializer serializing long UTF Strings

2024-01-22 Thread Steven Wu
I think this is a reasonable extension to `DataOutputSerializer`. Although
64 KB is not small, it is still possible to have long strings over that
limit. There are already precedents of extended APIs
`DataOutputSerializer`. E.g.

public void setPosition(int position) {
Preconditions.checkArgument(
position >= 0 && position <= this.position, "Position out
of bounds.");
this.position = position;
}

public void setPositionUnsafe(int position) {
this.position = position;
}


On Fri, Jan 19, 2024 at 2:51 AM Péter Váry 
wrote:

> Hi Team,
>
> During the root cause analysis of an Iceberg serialization issue [1], we
> have found that *DataOutputSerializer.writeUTF* has a hard limit on the
> length of the string (64k). This is inherited from the
> *DataOutput.writeUTF*
> method, where the JDK specifically defines this limit [2].
>
> For our use-case we need to enable the possibility to serialize longer UTF
> strings, so we will need to define a *writeLongUTF* method with a similar
> specification than the *writeUTF*, but without the length limit.
>
> My question is:
> - Is it something which would be useful for every Flink user? Shall we add
> this method to *DataOutputSerializer*?
> - Is it very specific for Iceberg, and we should keep it in Iceberg
> connector code?
>
> Thanks,
> Peter
>
> [1] - https://github.com/apache/iceberg/issues/9410
> [2] -
>
> https://docs.oracle.com/javase/8/docs/api/java/io/DataOutput.html#writeUTF-java.lang.String-
>