[ 
https://issues.apache.org/jira/browse/AVRO-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tiago Margalho updated AVRO-3223:
---------------------------------
    Description: 
I'd like to propose a change to the C# library to allow it to dynamically 
support codecs that aren't supported out of the box. That is, instead of adding 
more codecs to the library, the library should allow codecs to be resolved by 
the user.  At the moment, only deflate is supported by the library.

In fact, the library already supports "custom" codecs to be used when writing 
to an Avro file. A user can implement a Codec and pass it into the call to 
OpenWriter: 
{code:c#|title=DataFileWriter.cs|borderStyle=solid}
public static IFileWriter<T> OpenWriter(DatumWriter<T> writer, string path, 
Codec codec);
 {code}
However, there's no way a user can call DataFileReader.OpenRead or 
DataFileReader.OpenAppendWriter and specify how the codec should be resolved.

So what I'm proposing is that we allow these other methods to resolve custom 
codecs thus enabling the library to support codecs that are currently supported 
in other languages but not currently on the C# implementation.

Here are some reasons why I believe this would be a good approach to extend 
support to other codecs:
 - The .NET base class library may not support all codecs.
 - The library won't have to depend on other OSS projects with potentially 
conflicting license agreements.
 - The user is allowed to pick which implementation of zstd, snappy, etc is 
best for their use case (eg. may prefer a version compatible to an older 
version of .NET, or one that is more recent and optimised for more recent 
versions of .NET/C#).

h2. How this would work

At present, the library uses a codec resolution method that is only internally 
used:
{code:c#|title=DataFileReader.cs|borderStyle=solid}
        private Codec ResolveCodec()
        {
            return 
Codec.CreateCodecFromString(GetMetaString(DataFileConstants.MetaDataCodec));
        }
{code}
 This can be somewhat easily extended to support other "resolvers" but first we 
would need to allow "resolvers" to be registered to be discovered by the 
DataFileReader:
{code:c#|title=DataFileReader.cs|borderStyle=solid}
        /// <summary>
        /// Represents a function capable of resolving a codec identifying 
string
        /// into a matching codec implementation a reader can use to decompress 
data.
        /// </summary>
        public delegate Codec CodecResolver(string codecMetaString);
        private static List<CodecResolver> _codecResolvers = new 
List<CodecResolver>();

        /// <summary>
        /// Registers a function that will be used to resolve a codec 
identifying string
        /// into a matching codec implementation when reading compressed Avro 
data.
        /// </summary>
        public static void RegisterCodecResolver(CodecResolver resolver)
        {
            _codecResolvers.Add(resolver);
        }
 {code}
Thus, the implementation of `ResolveCodec` could be:
{code:c#|title=DataFileReader.cs|borderStyle=solid}
        private Codec ResolveCodec()
        {
            var codecString = GetMetaString(DataFileConstants.MetaDataCodec);
            
            foreach (var resolver in _codecResolvers)
            {
                var candidateCodec = resolver(codecString);
                if (candidateCodec != null)
                {
                    return candidateCodec;
                }
            }

            return Codec.CreateCodecFromString(codecString);
        }
{code}
h2. One additional change required

The only change left to make is related to how decompression buffers are passed 
around. When I tried adding support for zstd using this approach, I found that 
I couldn't decompress data because the DataFileReader reuses a `DataBlock` 
buffer and the actual compressed data length doesn't necessarily match the 
length of the buffer. Although some compression algos are likely to have 
metadata in frames/headers describing the length of the encoded payload, not 
all implementations will allow the buffer size to be different so I think it 
would make sense to make the buffer size explicit.
 As a consequence I had to make a change that allows us to pass the length of 
the data in the DataBlock into the Codec.Decompress method call:
{code:c#|title=DataFileReader.cs|borderStyle=solid}
if (HasNextBlock())
{
    _currentBlock = NextRawBlock(_currentBlock);
    _currentBlock.Data = _codec.Decompress(_currentBlock.Data, 
(int)this._blockSize);
    ...
{code}
I will be raising a PR shortly with the changes I've described above. Hopefully 
people will find this proposal reasonable but please let me know what you think!

Thanks!

  was:
I'd like to propose a change to the C# library to allow it to dynamically 
support codecs that aren't supported out of the box. That is, instead of adding 
more codecs to the library, the library should allow codecs to be resolved by 
the user.  At the moment, only deflate is supported by the library.

In fact, the library already supports "custom" codecs to be used when writing 
to an Avro file. A user can implement a Codec and pass it into the call to 
OpenWriter: 
{code:c#|title=DataFileWriter.cs|borderStyle=solid}
public static IFileWriter<T> OpenWriter(DatumWriter<T> writer, string path, 
Codec codec);
 {code}
However, there's no way a user can call DataFileReader.OpenRead or 
DataFileReader.OpenAppendWriter and specify how the codec should be resolved.

So what I'm proposing is that we allow these other methods to resolve custom 
codecs thus enabling the library to support codecs that are currently supported 
in other languages but not currently on the C# implementation.

Here are some reasons why I believe this would be a good approach to extend 
support to other codecs:
 - The .NET base class library may not support all codecs.
 - The library won't have to depend on other OSS projects with potentially 
conflicting license agreements.
 - The user is allowed to pick which implementation of zstd, snappy, etc is 
best for their use case (eg. may prefer a version compatible to an older 
version of .NET, or one that is more recent and optimised for more recent 
versions of .NET/C#).

h2. How this would work

At present, the library uses a codec resolution method that is only internally 
used:
{code:c#|title=DataFileReader.cs|borderStyle=solid}
        private Codec ResolveCodec()
        {
            return 
Codec.CreateCodecFromString(GetMetaString(DataFileConstants.MetaDataCodec));
        }
{code}
 This can be somewhat easily extended to support other "resolvers" but first we 
would need to allow "resolvers" to be registered to be discovered by the 
DataFileReader:
{code:c#|title=DataFileReader.cs|borderStyle=solid}
        /// <summary>
        /// Represents a function capable of resolving a codec identifying 
string
        /// into a matching codec implementation a reader can use to decompress 
data.
        /// </summary>
        public delegate Codec CodecResolver(string codecMetaString);
        private static List<CodecResolver> _codecResolvers = new 
List<CodecResolver>();

        /// <summary>
        /// Registers a function that will attempt to resolve a codec 
identifying string
        /// into a matching codec implementation when reading compressed Avro 
data.
        /// </summary>
        public static void RegisterCodecResolver(CodecResolver resolver)
        {
            _codecResolvers.Add(resolver);
        }
 {code}
Thus, the implementation of `ResolveCodec` could be:
{code:c#|title=DataFileReader.cs|borderStyle=solid}
        private Codec ResolveCodec()
        {
            var codecString = GetMetaString(DataFileConstants.MetaDataCodec);
            
            foreach (var resolver in _codecResolvers)
            {
                var candidateCodec = resolver(codecString);
                if (candidateCodec != null)
                {
                    return candidateCodec;
                }
            }

            return Codec.CreateCodecFromString(codecString);
        }
{code}
h2. One additional change required

The only change left to make is related to how decompression buffers are passed 
around. When I tried adding support for zstd using this approach, I found that 
I couldn't decompress data because the DataFileReader reuses a `DataBlock` 
buffer and the actual compressed data length doesn't necessarily match the 
length of the buffer. Although some compression algos are likely to have 
metadata in frames/headers describing the length of the encoded payload, not 
all implementations will allow the buffer size to be different so I think it 
would make sense to make the buffer size explicit.
 As a consequence I had to make a change that allows us to pass the length of 
the data in the DataBlock into the Codec.Decompress method call:
{code:c#|title=DataFileReader.cs|borderStyle=solid}
if (HasNextBlock())
{
    _currentBlock = NextRawBlock(_currentBlock);
    _currentBlock.Data = _codec.Decompress(_currentBlock.Data, 
(int)this._blockSize);
    ...
{code}
I will be raising a PR shortly with the changes I've described above. Hopefully 
people will find this proposal reasonable but please let me know what you think!

Thanks!


> Allow dynamic codec support
> ---------------------------
>
>                 Key: AVRO-3223
>                 URL: https://issues.apache.org/jira/browse/AVRO-3223
>             Project: Apache Avro
>          Issue Type: New Feature
>          Components: csharp
>            Reporter: Tiago Margalho
>            Priority: Minor
>
> I'd like to propose a change to the C# library to allow it to dynamically 
> support codecs that aren't supported out of the box. That is, instead of 
> adding more codecs to the library, the library should allow codecs to be 
> resolved by the user.  At the moment, only deflate is supported by the 
> library.
> In fact, the library already supports "custom" codecs to be used when writing 
> to an Avro file. A user can implement a Codec and pass it into the call to 
> OpenWriter: 
> {code:c#|title=DataFileWriter.cs|borderStyle=solid}
> public static IFileWriter<T> OpenWriter(DatumWriter<T> writer, string path, 
> Codec codec);
>  {code}
> However, there's no way a user can call DataFileReader.OpenRead or 
> DataFileReader.OpenAppendWriter and specify how the codec should be resolved.
> So what I'm proposing is that we allow these other methods to resolve custom 
> codecs thus enabling the library to support codecs that are currently 
> supported in other languages but not currently on the C# implementation.
> Here are some reasons why I believe this would be a good approach to extend 
> support to other codecs:
>  - The .NET base class library may not support all codecs.
>  - The library won't have to depend on other OSS projects with potentially 
> conflicting license agreements.
>  - The user is allowed to pick which implementation of zstd, snappy, etc is 
> best for their use case (eg. may prefer a version compatible to an older 
> version of .NET, or one that is more recent and optimised for more recent 
> versions of .NET/C#).
> h2. How this would work
> At present, the library uses a codec resolution method that is only 
> internally used:
> {code:c#|title=DataFileReader.cs|borderStyle=solid}
>         private Codec ResolveCodec()
>         {
>             return 
> Codec.CreateCodecFromString(GetMetaString(DataFileConstants.MetaDataCodec));
>         }
> {code}
>  This can be somewhat easily extended to support other "resolvers" but first 
> we would need to allow "resolvers" to be registered to be discovered by the 
> DataFileReader:
> {code:c#|title=DataFileReader.cs|borderStyle=solid}
>         /// <summary>
>         /// Represents a function capable of resolving a codec identifying 
> string
>         /// into a matching codec implementation a reader can use to 
> decompress data.
>         /// </summary>
>         public delegate Codec CodecResolver(string codecMetaString);
>         private static List<CodecResolver> _codecResolvers = new 
> List<CodecResolver>();
>         /// <summary>
>         /// Registers a function that will be used to resolve a codec 
> identifying string
>         /// into a matching codec implementation when reading compressed Avro 
> data.
>         /// </summary>
>         public static void RegisterCodecResolver(CodecResolver resolver)
>         {
>             _codecResolvers.Add(resolver);
>         }
>  {code}
> Thus, the implementation of `ResolveCodec` could be:
> {code:c#|title=DataFileReader.cs|borderStyle=solid}
>         private Codec ResolveCodec()
>         {
>             var codecString = GetMetaString(DataFileConstants.MetaDataCodec);
>             
>             foreach (var resolver in _codecResolvers)
>             {
>                 var candidateCodec = resolver(codecString);
>                 if (candidateCodec != null)
>                 {
>                     return candidateCodec;
>                 }
>             }
>             return Codec.CreateCodecFromString(codecString);
>         }
> {code}
> h2. One additional change required
> The only change left to make is related to how decompression buffers are 
> passed around. When I tried adding support for zstd using this approach, I 
> found that I couldn't decompress data because the DataFileReader reuses a 
> `DataBlock` buffer and the actual compressed data length doesn't necessarily 
> match the length of the buffer. Although some compression algos are likely to 
> have metadata in frames/headers describing the length of the encoded payload, 
> not all implementations will allow the buffer size to be different so I think 
> it would make sense to make the buffer size explicit.
>  As a consequence I had to make a change that allows us to pass the length of 
> the data in the DataBlock into the Codec.Decompress method call:
> {code:c#|title=DataFileReader.cs|borderStyle=solid}
> if (HasNextBlock())
> {
>     _currentBlock = NextRawBlock(_currentBlock);
>     _currentBlock.Data = _codec.Decompress(_currentBlock.Data, 
> (int)this._blockSize);
>     ...
> {code}
> I will be raising a PR shortly with the changes I've described above. 
> Hopefully people will find this proposal reasonable but please let me know 
> what you think!
> Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to