Re: Avro JSON Encoding

2024-04-19 Thread Ryan Skraba
Hello!

A bit tongue in cheek: the one advantage of the current Avro JSON
encoding is that it drives users rapidly to prefer the binary
encoding!  In its current state, Avro isn't really a satisfactory
toolkit for JSON interoperability, while it shines for binary
interoperability. Using JSON with Avro schemas is pretty unwieldy and
a JSON data designer will almost never be entirely satisfied with the
JSON "shape" they can get... today it's useful for testing and
debugging.

That being said, it's hard to argue with improving this experience
where it can help developers that really want to use Avro JSON for
data transfer, especially for things accepting JSON where the
intention is clearly unambiguous or allowing optional attributes to be
missing.  I'd be enthusiastic to see some of these improvements,
especially if we keep the possibility of generating strict Avro JSON
for forwards and backwards compatibility.

My preference would be to avoid adding JSON-specific attributes to the
spec where possible.  Maybe we could consider implementing Avro JSON
"variants" by implementing encoder options, or alternative encorders
for an SDK. There's probably a nice balance between a rigorous and
interoperable (but less customizable) JSON encoding, and trying to
accommodate arbitrary JSON in the Avro project.

All my best and thanks for this analysis -- I'm excited to see where
this leads!  Ryan









On Thu, Apr 18, 2024 at 8:01 PM Oscar Westra van Holthe - Kind
 wrote:
>
> Thank you Clemens,
>
> This is a very detailed set of proposals, and it looks like it would work.
>
> I do however, feel we'd need to define a way to unions with records. Your 
> proposal lists various options, of which the discriminatory option seems most 
> portable to me.
>
> You mention the "displayName" proposal. I don't like that, as it mixes data 
> with UI elements. The discriminator option can specify a fixed or 
> configurable field to hold the type of the record.
>
> Kind regards,
> Oscar
>
>
> --
> Oscar Westra van Holthe - Kind 
>
> Op do 18 apr. 2024 10:12 schreef Clemens Vasters via user 
> :
>>
>> Hi everyone,
>>
>>
>>
>> the current JSON Encoding approach severely limits interoperability with 
>> other JSON serialization frameworks. In my view, the JSON Encoding is only 
>> really useful if it acts as a bridge into and from JSON-centric applications 
>> and it currently gets in its own way.
>>
>>
>>
>> The current encoding being what it is, there should be an alternate mode 
>> that emphasizes interoperability with JSON “as-is” and allows Avro Schema to 
>> describe existing JSON document instances such that I can take someone’s 
>> existing JSON document in on one side of a piece of software and emit Avro 
>> binary on the other side while acting on the same schema.
>>
>>
>>
>> There are four specific issues:
>>
>>
>>
>> Binary Values
>> Unions with Primitive Type Values and Enum Values
>> Unions with Record Values
>> DateTime
>>
>>
>>
>> One by one:
>>
>>
>>
>> 1. Binary values:
>>
>> -
>>
>>
>>
>> Binary values are (fixed and bytes) are encoded as escaped unicode literals. 
>> While I appreciate the creative trick, it costs 6 bytes for each encoded 
>> byte. I have a hard time finding any JSON libraries that provide a 
>> conversion of such strings from/to byte arrays, so this approach appears to 
>> be idiosyncratic for Avro’s JSON Encoding.
>>
>>
>>
>> The common way to encode binary in JSON is to use base64 encoding and that 
>> is widely and well supported in libraries. Base64 is 33% larger than plain 
>> bytes, the encoding chosen here is 500% (!) larger than plain bytes.
>>
>>
>>
>> The Avro decoder is schema-informed and it knows that a field is expected to 
>> hold bytes, so it’s easy to mandate base64 for the field content in the 
>> alternate mode.
>>
>>
>>
>> 2. Unions with Primitive Type Values and Enum Values
>>
>> -
>>
>>
>>
>> It’s common to express optionality in Avro Schema by creating a union with 
>> the “null” type, e.g. [“string”, “null”]. The Avro JSON Encoding opts to 
>> encode such unions, like any union, as { “{type}”: {value} } when the value 
>> is non-null.
>>
>>
>>
>> This choice ignores common practice and the fact that JSON’s values are 
>> dynamically typed (RFC8259 Section-3) and inherently accommodate unions. The 
>> conformant way to encode a value choice of null or “string” into a JSON 
>> value is plainly null and “string”.
>>
>>
>>
>> “foo” : null
>>
>> “foo”: “value”
>>
>>
>>
>> The “field default values” table in the Avro spec maps Avro types to the 
>> JSON types null, boolean, integer, number, string, object, and array, all of 
>> which can be encoded into and, more importantly, unambiguously decoded from 
>> a JSON value. The only semi-ambiguous case is integer vs. number, which is a 
>> convention in JSON rather than a distinct type, but any Avro serializer is 
>> guided by type information and can easily make that distinction.
>>
>>
>>
>> 3. 

CVE-2023-39410: Apache Avro Java SDK: Memory when deserializing untrusted data in Avro Java SDK

2023-09-29 Thread Ryan Skraba
Severity: low

Affected versions:

- Apache Avro Java SDK before 1.11.3

Description:

When deserializing untrusted or corrupted data, it is possible for a reader to 
consume memory beyond the allowed constraints and thus lead to out of memory on 
the system.

This issue affects Java applications using Apache Avro Java SDK up to and 
including 1.11.2.  Users should update to apache-avro version 1.11.3 which 
addresses this issue.

This issue is being tracked as AVRO-3819 

Credit:

Adam Korczynski at ADA Logics Ltd (finder)

References:

https://avro.apache.org/
https://www.cve.org/CVERecord?id=CVE-2023-39410
https://issues.apache.org/jira/browse/AVRO-3819



[ANNOUNCE] Apache Avro 1.11.3 released

2023-09-26 Thread Ryan Skraba
The Apache Avro community is pleased to announce the release of Avro 1.11.3!

All signed release artifacts, signatures and verification instructions can
be found here: https://avro.apache.org/releases.html

This is a minor release, specifically addressing known issues with the
1.11.2 release, but also contains version bumps and doc fixes. The
link to all fixed JIRA issues and a brief summary can be found at:
https://github.com/apache/avro/releases/tag/release-1.11.3

In addition, language-specific release artifacts are available:

* C#: https://www.nuget.org/packages/Apache.Avro/1.11.3
* Java: https://repo1.maven.org/maven2/org/apache/avro/avro/1.11.3/
* Javascript: https://www.npmjs.com/package/avro-js/v/1.11.3
* Perl: https://metacpan.org/release/Avro
* Python 3: https://pypi.org/project/avro/1.11.3
* Ruby: https://rubygems.org/gems/avro/versions/1.11.3
* Rust: https://crates.io/crates/apache-avro/0.16.0

Thanks to everyone for contributing!

Ryan


Re: EOS/EOL Date

2023-07-17 Thread Ryan Skraba
Hello!  While Avro doesn't have an official "end-of-life" statement or
policy, there is no active development on the 1.9 or 1.10 branch.

Our current policy is to add major features to the next major release
(1.12.0) while bug fixes, CVEs and minor features will be backported
to the next minor release (1.11.3).

I think we *should* have a policy in place, for projects that depend
on Avro to have a better visiblity.  I will bring this up on the
d...@avro.apache.org mailing list -- please feel free to join the
discussion!

All my best, Ryan


On Mon, Jul 17, 2023 at 11:19 AM Pranav Kumar (EXT) via user
 wrote:
>
> Hi,
>
>
>
> Could you please share End of life/End of support detail or any EoS criteria 
> that is followed for below:
>
>
>
> Apache Avro version-1.9.2
>
>
>
> Regards,
>
> Pranav


[ANNOUNCE] Apache Avro 1.11.2 released

2023-07-11 Thread Ryan Skraba
The Apache Avro community is pleased to announce the release of Avro 1.11.2!

All signed release artifacts, signatures and verification instructions can
be found here: https://avro.apache.org/releases.html

This release addresses ~89 Avro JIRA, including some interesting highlights:

C#
- AVRO-3434: Support logical schemas in reflect reader and writer
- AVRO-3670: Add NET 7.0 support
- AVRO-3724: Fix C# JsonEncoder for nested array of records
- AVRO-3756: Add a method to return types instead of writing them to disk

C++
- AVRO-3601: C++ API header contains breaking include
- AVRO-3705: C++17 support

Java
- AVRO-2943: Add new GenericData String/Utf8 ARRAY comparison test
- AVRO-2943: improve GenericRecord MAP type comparison
- AVRO-3473: Use ServiceLoader to discover Conversion
- AVRO-3536: Inherit conversions for Union type
- AVRO-3597: Allow custom readers to override string creation
- AVRO-3560: Throw SchemaParseException on dangling content beyond end of schema
- AVRO-3602: Support Map(with non-String keys) and Set in ReflectDatumReader
- AVRO-3676: Produce valid toString() for UUID JSON
- AVRO-3698: SpecificData.getClassName must replace reserved words
- AVRO-3700: Publish Java SBOM artifacts with CycloneDX
- AVRO-3783: Read LONG length for bytes, only allow INT sizes
- AVRO-3706: accept space in folder name

Python
- AVRO-3761: Fix broken validation of nullable UUID field
- AVRO-3229: Raise on invalid enum default only if validation enabled
- AVRO-3622: Fix compatibility check for schemas having or missing namespace
- AVRO-3669: Add py.typed marker file (PEP561 compliance)
- AVRO-3672: Add CI testing for Python 3.11
- AVRO-3680: allow to disable name validation

Ruby
- AVRO-3775: Fix decoded default value of logical type
- AVRO-3697: Test against Ruby 3.2
- AVRO-3722: Eagerly initialize instance variables for better inline cache hits

Rust
- Many, many bug fixes and implementation progress in this experimental SDK.
- Rust CI builds and lints are passing, and has been released to
crates.io as version 0.15.0

In addition:

- Upgrade dependencies to latest versions, including CVE fixes.
- Testing and build improvements.
- Performance fixes, other bug fixes, better documentation and more...

The link to all fixed JIRA issues and a brief summary can be found at:
https://github.com/apache/avro/releases/tag/release-1.11.2

In addition, language-specific release artifacts are available:

* C#: https://www.nuget.org/packages/Apache.Avro/1.11.2
* Java: https://repo1.maven.org/maven2/org/apache/avro/avro/1.11.2/
* Javascript: https://www.npmjs.com/package/avro-js/v/1.11.2
* Perl: https://metacpan.org/release/Avro
* Python 3: https://pypi.org/project/avro/1.11.2
* Ruby: https://rubygems.org/gems/avro/versions/1.11.2
* Rust: https://crates.io/crates/apache-avro/0.15.0

**Important**: a known issue has been discovered after the release that may
affect the Java SDK when using the MAP type.

- AVRO-3789 [Java]: Problem when comparing empty MAP types.

Thanks to everyone for contributing!


Re: Modifying a field's schema property in Java

2022-11-23 Thread Ryan Skraba
Thanks Oscar!

Julien (or anyone else) -- do you think it would be useful to have a
category of "Schema" objects that are mutable for the Java SDK?

Something like:

MutableSchema ms = originalSchema.unlock();
ms.getField("quantity").setProperty("precision", 5);
ms.getField("dept").setFieldName("department_id");
ms.getField("department_id").setType(Schema.Type.LONG);
Schema modifiedSchema = ms.lock();

This would be a major change to the Java SDK, but in the past, we've
used a lot of "ad hoc" or dynamic, transient schemas and making
changes has always been a pain point!

All my best, Ryan

On Sun, Nov 13, 2022 at 8:19 AM Oscar Westra van Holthe - Kind
 wrote:
>
> On sun 13 nov. 2022 05:34, Julien Phalip  wrote:
>>
>> I've got a schema with multiple levels of depths (sub-records) that I would 
>> need to change slightly. [...]
>>
>> Is there a way to make this type of modification on an existing schema, or 
>> do you have to recreate the entire schema from scratch?
>
>
> After creation, Avro schemata are immutable. To make such modifications you 
> can use a visitor. There already is some code available to help you along: 
> you can find an example in the module avro-compiler, that replaces references 
> to named schemata with the actual schema.
>
> IIRC, you're looking for the Schemas class. The interface you need to 
> implement has the word 'visitor' in the name.
>
> Kind regards,
> Oscar
>
> --
> Oscar Westra van Holthe - Kind 


[ANNOUNCE] Apache Avro 1.11.1 released

2022-08-08 Thread Ryan Skraba
The Apache Avro community is pleased to announce the release of Avro 1.11.0!

All signed release artifacts, signatures and verification instructions can
be found here: https://avro.apache.org/releases.html

This release includes ~250 Jira issues, including some interesting features:

Some interesting highlights:

Avro specification
- [AVRO-3436] Clarify which names are allowed to be qualified with namespaces
- [AVRO-3370] Inconsistent behaviour on types as invalid names
- [AVRO-3275] Clarify how fullnames are created, with example
- [AVRO-3257] IDL: add syntax to create optional fields
- [AVRO-2019] Improve docs for logical type annotation

C++
- [AVRO-2722] Use of boost::mt19937 is not thread safe

C#
- [AVRO-3383] Many completed subtasks for modernizing C# coding style
- [AVRO-3481] Input and output variable type mismatch
- [AVRO-3475] Enforce time-millis and time-micros specification
- [AVRO-3469] Build and test using .NET SDK 7.0
- [AVRO-3468] Default values for logical types not supported
- [AVRO-3467] Use oracle-actions to test with Early Access JDKs
- [AVRO-3453] Avrogen Add Generated Code Attribute
- [AVRO-3432] Add command line option to skip creation of directories
based on namespace path
- [AVRO-3411] Add Visual Studio Code Devcontainer support
- [AVRO-3388] Implement extra codecs for C# as seperate nuget packages
- [AVRO-3265] avrogen generates uncompilable code when namespace ends
with ".Avro"
- [AVRO-3219] Support nullable enum type fields

Java
- [AVRO-3531] GenericDatumReader in multithread lead to infinite loop
- [AVRO-3482] Reuse MAGIC in DataFileReader
- [AVRO-3586] Make Avro Build Reproducible
- [AVRO-3441] Automatically register LogicalTypeFactory classes
- [AVRO-3375] Add union branch, array index and map key "path"
information to serialization errors
- [AVRO-3374] Fully qualified type reference "ns.int" loses namespace
- [AVRO-3294] IDL parsing allows doc comments in strange places
- [AVRO-3273] avro-maven-plugin breaks on old versions of Maven
- [AVRO-3266] Output stream incompatible with MagicS3GuardCommitter
- [AVRO-3243] Lock conflicts when using computeIfAbsent
- [AVRO-3120] Support Next Java LTS (Java 17)
- [AVRO-2498] UUID generation is not working

Javascript
- [AVRO-3489] Replace istanbul with nyc for code coverage
- [AVRO-3322] Buffer is not defined in browser environment
- [AVRO-3084] Fix JavaScript interop test to read files generated by
other languages on CI

Perl
- [AVRO-3263] Schema validation warning on invalid schema with a long field

Python
- [AVRO-3542] Scale assignment optimization
- [AVRO-3521] "Scale" property from decimal object
- [AVRO-3380] Byte reading in avro.io does not assert read bytes to
requested nbytes
- [AVRO-3229] validate the default value of an enum field
- [AVRO-3218] Pass LogicalType to BytesDecimalSchema

Ruby
- [AVRO-3277] Test against Ruby 3.1

Rust
- [AVRO-3558] Add a demo crate that shows usage as WebAssembly
- [AVRO-3526] Improve resolving Bytes and Fixed from string
- [AVRO-3506] Implement Single Object Writer
- [AVRO-3507] Implement Single Object Reader
- [AVRO-3405] Add API for user-provided metadata to file
- [AVRO-3339] Rename crate from avro-rs to apache-avro
- [AVRO-3479] Derive Avro Schema macro

Website
- [AVRO-2175] Website refactor
- [AVRO-3450] Document IDL support in IDEs

This is the first release that provides the Rust apache-avro crate at crates.io!

And of course upgraded dependencies to latest versions, CVE fixes and more
https://issues.apache.org/jira/issues/?jql=project%20%3D%20AVRO%20AND%20fixVersion%20%3D%201.11.1

The link to all fixed JIRA issues and a brief summary can be found at:
https://github.com/apache/avro/releases/tag/release-1.11.1

In addition, language-specific release artifacts are available:

* C#: https://www.nuget.org/packages/Apache.Avro/1.11.1
* Java: from Maven Central,
* Javascript: https://www.npmjs.com/package/avro-js/v/1.11.1
* Perl: https://metacpan.org/release/Avro
* Python 3: https://pypi.org/project/avro/1.11.1/
* Ruby: https://rubygems.org/gems/avro/versions/1.11.1
* Rust: https://crates.io/crates/apache-avro/0.14.0

Thanks to everyone for contributing!


Re: Reflection Based Serializer on an Interface

2022-02-15 Thread Ryan Skraba
Hello!  I was hoping someone has better news, but I'm afraid there's a
couple of constraints in using interfaces with ReflectData.

My recommendation would be to create a Schema from your actual
concrete implementation, and drop it onto your interface with an
@AvroSchema annotation.  It's not necessarily the best solution,
because the name of the schema will (and must) be the concrete
implementation.

I put an example here:
https://github.com/RyanSkraba/avro-enchiridion/blob/c1951937661390ca6365033aaae12d2c9e8a6a20/core/src/test/java/com/skraba/avro/enchiridion/core/ReflectDataTest.java#L110

In that test case, the Issue interface is annotated with the schema
that you would get from ReflectData.get().getSchema(IssueImpl.class),
and you would *have* to keep the two in sync.

I hope this is helpful, Ryan

On Fri, Jan 28, 2022 at 1:41 AM Swamy Thota  wrote:
>
> Hi All,
>
> I have POJO model where I have an interface which doesn't have any fields 
> defined, but the actual implementation determined during runtime has several 
> fields defined. When I generate the schema using reflection it doesn't take 
> into account of the actual implementation. When I try to serialize the POJO 
> it's throwing an exception saying it doesn't know specific fields in the 
> implementation. How do I work around this? Is there a way to hint the 
> Serializer to use specific implementation that contains fields?
>
> Thanks,
> Swamy


Re: Gigantic list of Avro spec issues

2022-02-14 Thread Ryan Skraba
Hello!

Thanks, Dan, for your calm and measured response -- you've given some
excellent advice on how someone can make a positive contribution to
the project and the spec.

Askar, your approach in presenting your specification review should
have been more constructive: it isn't useful to cross-post across
lists and recipients indiscriminately, and please remain civil when
pointing out (potential) errors.  I understand that the Avro
specification isn't perfect, but I think the community has always been
open to making changes and providing clarifications.

I'm glad that you've found some things to appreciate in the Avro
project, and I hope you choose to contribute in making the
improvements that you'd like to see.

Ryan

[1] https://www.apache.org/foundation/policies/conduct

On Sun, Feb 13, 2022 at 6:05 PM Dan Schmitt  wrote:
>
> I will admit the spec is likely weak around unicode/utf encoding,
> (as the serialization of strings to json isn't consistent across all
> language bindings) but I file the ticket against implementations
> and write test cases rather than make guesses at the spec
> wording and make demands without knowing it's a real issue.
>
> On Sat, Feb 12, 2022 at 6:55 PM Dan Schmitt  wrote:
> >
> > Generally the "I'm going to lump all my complaints into
> > one big bug" is a good way to get them ignored.
> >
> > I'll skip "the design is wrong and it should change because
> > I don't like it" and cite "it's used everywhere with lots of
> > implementations so you can't change it in an incompatible way".
> >
> > I'll skip obvious typos and suggest you catch more flies with
> > honey than vinegar, and you can work a git repository and
> > make a pull request for those and they'd be fixed fast.  (Or
> > read 2 or 3 of the 8 implementations if the phrasing is confusing
> > to you.)
> >
> > I'll make some suggestions on the technical details
> > So, specifically:
> >
> > * [Opinion] [Line 737]. "The 16-byte, randomly-generated sync marker
> > for this file". I don't like this point. It
> > implies that container files are usually not equal. Thus it is not
> > possible to compare them bitwise to determine
> > equality. So, in my Avro implementation I write null bytes instead of
> > this marker (yes, this possibly means that my
> > implementation is non-conforming)
> > * [Opinion] [Line 717]. There is no any marker for end of container
> > file. Thus there is no way to determine whether all
> > data was written
> >
> > If you use the sync marker, you don't need an end of container marker.
> > (Flush the sync and container block map with new data after the new
> > block is written, if you have the metadata and block list you know that
> > much is complete and written for you to read, if you read the metadata
> > and your sync byte marker is wrong, go re-read/continue read.)
> >
> > * [Very theoretical bug, possible even security-related] [Line 435].
> > Since you have a test case that proves it doesn't crash systems, it's
> > sort of not a bug right?  You could at the test case to the test suite.
> >
> > * [Bug] [Line 572]. "Currently for C/C++ implementations, the
> > positions are practically an int, but theoretically a
> > long". Wat? So, other implementations use int (as per spec), but C++
> > uses long, right? So, go fix C++ implementation to
> > match spec and other implementations
> >
> > This is not a bug, but an acknowledgement that the C/C++ offsets
> > are internally implemented via pointer math to be efficient but if
> > you try to read in enough data that a long offset makes sense,
> > you will be sad/run out of memory.   That the internal implementation
> > for C/C++ supports the minimum required by the specification.
> >
> > * [Bug] [Line 385]. "For example, int and long are always serialized
> > the same way". What this means? You probably mean
> > that *same* int and long (i. e. int and long, which are numerically
> > identical) serialized the same way.
> >
> > That rewrite is wrong.  Your wording would allow serialization to be
> > altered by value (e.g. it would be allowable to use big endian storage
> > for odd numbers and little endian for even as each same int and long
> > would be serialized the same way.)
> >
> > * [Opinion] [Line 596]. "if its type is null, then it is encoded as a
> > JSON null". There is no reasons to special-case
> > nulls. This is additional requirement, which adds complexity to
> > implementations without any reasons
> >
> > You are making assumptions about implementation encoding of null.
> > A C developer would say writing 0x00 to the file that you will read back
> > later is fine for null or false or 0.
> >
> > * [Bug] [Line 417]. - can't take action, the request breaks compatibility.
> >
> > * [Bug] [Line 292]. "The null namespace may not be used in a
> > dot-separated sequence of names". You defined previously
> > null namespace as a empty string instead of *whole* namespace. I. e.
> > null namespace is lack of namespace (i. e. lack of
> > whole 

Re: Avro Maven Plugin for Shared Avro library

2022-01-24 Thread Ryan Skraba
Hello,

As you note, It currently isn't possible to "pre-shade" Avro, but I
completely understand why you might want to do it!  Shading avro is a
common thing to do (see
https://beam.apache.org/documentation/io/built-in/parquet/ for
example).

I guess we _might_ be able to fiddle with the maven shade plugin to do
the relocating of these classes after compiling the generated classes
(without including the relocated classes in the jar?), but I prefer
your solution of a configuration option passed to the template.  It's
more direct and can be done in a single step.

There's a contribution guide[1], but it boils down to creating a JIRA
and making a PR on github!  Don't hesitate to reach out, and thanks
for the suggestion!

Ryan

[1]: https://cwiki.apache.org/confluence/display/AVRO/How+To+Contribute



On Fri, Jan 21, 2022 at 12:36 PM Enrico Olivelli  wrote:
>
> Hello,
> in the Pulsar project we use a shaded version of Avro, that is, we
> package Avro in a uber Java for the Pulsar client by renaming all the
> classes from org.apache.avro to
> org.apache.pulsar.shaded.org.apache.avro.
>
> For this reason users of the Pulsar client cannot generate the Avro
> classes using the Maven plugin because the generated classes expect
> Avro classes with the canonical package name.
>
> For the sake of completeness: you could use a version of the Pulsar
> client without the shaded version of Avro , but that's not always the
> case because in that case the Pulsar client does not shade all of the
> other third party dependencies and it often clashes with other third
> party libraries.
>
> My question is:
> Is it possible to tell the Maven plugin to use a different package
> name for Avro ?
>
> My understanding is that we are using a fixed Velocity template
> https://github.com/apache/avro/blob/44737386cb17a359515f068e7fe9caa0f7bfff70/lang/java/compiler/src/main/velocity/org/apache/avro/compiler/specific/templates/java/classic/record.vm#L22
>
> I would be happy to contribute a patch following your guide if there
> is no way to achieve that without code changes.
>
> Best regards
> Enrico Olivelli


Re: UUID Logical type not working in Java

2022-01-18 Thread Ryan Skraba
Hello!

This is a known issue in 1.11.0 and before, and has recently been
fixed.   There's some discussion at AVRO-2498.  The bad news is that
if you're using avro-tools to generate your code, there isn't any
released version yet that contains the fix.  Martin sent the link to
the SNAPSHOT repository in the meantime.

The good news is that if you are using the Avro maven plugin, there is
a workaround, described in AVRO-2548.  You can manually add the
missing conversion to the avro plugin configuration as shown in that
JIRA.

I hope this helps, let us know if it works for you!

All my best, Ryan

[AVRO-2498]: https://issues.apache.org/jira/browse/AVRO-2498
[AVRO-2548]: https://issues.apache.org/jira/browse/AVRO-2548

On Sat, Jan 15, 2022 at 1:54 AM Swamy Thota  wrote:
>
> Thanks Martin for the quick response, I will give it a try and let you know. 
> I tried with 1.11.0 earlier.
>
> On Fri, 14 Jan 2022, 3:49 pm Martin Grigorov,  wrote:
>>
>> Hi,
>>
>> Which version of Avro do you use ?
>> Which language ?
>> If Java, then please try 1.12.0-SNAPSHOT from 
>> https://repository.apache.org/content/groups/snapshots/
>> If it still does not work then please show us your schema or even better 
>> create a reproducer and share it with us, e.g. at Github.
>>
>> On Fri, Jan 14, 2022 at 10:22 PM Swamy Thota  wrote:
>>>
>>> Hello All,
>>>
>>> I'm trying to use uuid logical type in one of my AVRO schema with string as 
>>> avro type. The generated class has the field type as string but I'm 
>>> expecting it to be UUID, is this a known issue? It works for other logical 
>>> types local-timestamp-millis as an example. Appreciate your help in advance.
>>>
>>> Thanks,
>>> Swamy Thota


Re: Papers discussing Apache Avro

2022-01-14 Thread Ryan Skraba
This is really cool news -- it's always really interesting to see
benchmark studies and the trade-offs we make while choosing different
formats.  Thanks for sharing!

I'd love to see links to some curated articles and papers on the
website!  I created AVRO-3308 if you don't object :D

Ryan

On Fri, Jan 14, 2022 at 10:49 AM Martin Grigorov  wrote:
>
> Hi Juan,
>
> Thank you for sharing your work with us!
>
> It comes right in time for me!
> I am working on the interop tests for the new Rust module and it seems there 
> is some problem to read the .avro files generated by Java. I may need to dive 
> in the binary diffs.
>
> Regards,
> Martin
>
> On Thu, Jan 13, 2022 at 11:14 PM Juan Cruz Viotti  wrote:
>>
>> Hey there!
>>
>> As part of my MSc dissertation at University of Oxford, I wrote and
>> published two papers covering the characteristics of various binary
>> serialization formats, including Apache Avro and performing a
>> space-efficiency benchmark, respectively.
>>
>> Sharing them here in case anybody finds them interesting! The first
>> paper explains how Apache Avro works including an annotated hexadecimal
>> example and the second compares Apache Avro to various other popular
>> serialization formats.
>>
>> - A Survey of JSON-compatible Binary Serialization Specifications:
>>   https://arxiv.org/abs/2201.02089
>> - A Benchmark of JSON-compatible Binary Serialization Specifications:
>>   https://arxiv.org/abs/2201.03051
>>
>> The benchmark study has proved Apache Avro to be one of the most
>> space-efficient formats considered.
>>
>> All the best!
>>
>> --
>> Juan Cruz Viotti
>> Technical Lead @ Postman.com
>> https://www.jviotti.com


Re: New website

2021-12-12 Thread Ryan Skraba
Hello!

I realized that I haven't commented on this mailing list thread -- I
made some comments on https://issues.apache.org/jira/browse/AVRO-2175

This looks amazing and we should merge it very soon :D  It's not
perfect, but it's really a great improvement and definitely not worst
than the existing website!

I've been taking a look at what we need to use the existing
infrastructure, and there's interesting links at:

- https://infra.apache.org/release-download-pages.html
- https://cwiki.apache.org/confluence/display/INFRA/Git+-+.asf.yaml+features
- https://infra.apache.org/website-guidelines.html
- https://infra.apache.org/project-site.html

I like that Beam has the website in the main repo, but notably the
INFRA recommendation is that we use a separate repo for the website
(named `avro-site` though).  Any thoughts?  It's always something we
can try and change later!  Would it make it easier for the javadoc and
other languages if they were in the same repo, or does it make little
difference?

The old site actually contains all of the documentation for EVERY
release, which can be found here:

- https://svn.apache.org/repos/asf/avro/site/publish/docs/
- https://avro.apache.org/docs/

Would it be tricky to adjust your work to mirror the existing
structure for existing docs?  I'm not even too fussy about not
breaking the links in /docs/current/ but all of the existing pages
such as /docs/1.7.7/ should be maintained if possible!

There's so many good suggestions here for future work and improving
our message and communication, I created
https://issues.apache.org/jira/browse/AVRO-3264 to point to this
discussion after we get this up.

Thanks again for your great work!

Ryan

On Thu, Nov 4, 2021 at 4:26 PM Lee Hambley  wrote:
>
> I speak only for myself, but I am working in an environment where I am 
> regularly checking docs all the way back to 1.8.x because we have legacy 
> systems we cannot upgrade, and I am often referencing rules about schema 
> canonical form. I value a lot the sidebar bottom version switching navigation 
> from sites such as here 
> https://fastavro.readthedocs.io/en/latest/writer.html#using-the-record-hint-to-specify-which-branch-of-a-union-to-take
>  ... but I know it can be extraordinarily difficult to make it work correctly 
> with these static site generators.
>
> Lee Hambley
> http://lee.hambley.name/
> +49 (0) 170 298 5667
>
>
> On Thu, 4 Nov 2021 at 16:23, Martin Grigorov  wrote:
>>
>>
>>
>> On Thu, Nov 4, 2021 at 5:04 PM Ismaël Mejía  wrote:
>>>
>>> Wow this is pretty neat ! Nice job Martin! A modern website can
>>> encourage more contributions.
>>> I am more interested on content than aesthetics first. Is everything
>>> already migrated? Anything missing? Any issue to report?
>>
>>
>> Everything is migrated for the documentation of the *current* version.
>> The old site contains documentation for both current and current-1. Is this 
>> something you would like to preserve ?
>>
>>>
>>>
>>>
>>> On Tue, Nov 2, 2021 at 7:01 PM Martin Grigorov  wrote:
>>> >
>>> > Hi,
>>> >
>>> > Anyone willing to send a PR with the suggested improvement?
>>> > Or at least open an issue with the well formulated text and I will add it!
>>> >
>>> > Regards,
>>> > Martin
>>> >
>>> > On Tue, Nov 2, 2021, 18:08 Oscar Westra van Holthe - Kind 
>>> >  wrote:
>>> >>
>>> >> Hi,
>>> >>
>>> >> This is a huge improvement. Responsive, excellent navigation, syntax
>>> >> highlighting, ...
>>> >>
>>> >> The only downside I see was already mentioned by Lee: the landing page is
>>> >> too empty (also in a mobile browser).
>>> >> I think we could really benefit from mentioning the unique selling point 
>>> >> of
>>> >> Avro here: "Your Data. Any Time, Anywhere." And then mention the language
>>> >> availability & excellent schema evolution.
>>> >>
>>> >> Kind regards,
>>> >> Oscar
>>> >>
>>> >> On Thu, 28 Oct 2021 at 10:43, Martin Grigorov  
>>> >> wrote:
>>> >>
>>> >> > Hi all,
>>> >> >
>>> >> > Please check the new candidate for Apache Avro website:
>>> >> > https://avro-website.netlify.app/
>>> >> >
>>> >> > It is based on Hugo and uses Docsy theme.
>>> >> > Its source code and instructions how to build could be found at
>>> >> > https://github.com/martin-g/avro-website.
>>> >> > The JIRA ticket is: https://issues.apache.org/jira/browse/AVRO-2175
>>> >> >
>>> >> > I am not web designer, so some things may look not finished.
>>> >> > I've just copied the HTML content from the old site (
>>> >> > https://avro.apache.org/) and converted it to Markdown for Hugo.
>>> >> >
>>> >> > Any feedback is welcome! With Pull Requests would be awesome!
>>> >> >
>>> >> > Regards,
>>> >> > Martin
>>> >> >


[ANNOUNCE] Apache Avro 1.11.0 released

2021-10-31 Thread Ryan Skraba
The Apache Avro community is pleased to announce the release of Avro 1.11.0!

All signed release artifacts, signatures and verification instructions can
be found here: https://avro.apache.org/releases.html

This release includes 120 Jira issues, including some interesting features:

Specification: AVRO-3212 Support documentation tags for FIXED types
C#: AVRO-2961 Support dotnet framework 5.0
C#: AVRO-3225 Prevent memory errors when deserializing untrusted data
C++: AVRO-2923 Logical type corrections
Java: AVRO-2863 Support Avro core on android
Javascript: AVRO-3131 Drop support for node.js 10
Perl: AVRO-3190 Fix error when reading from EOF
Python: AVRO-2906 Improved performance validating deep record data
Python: AVRO-2914 Drop Python 2 support
Python: AVRO-3004 Drop Python 3.5 support
Ruby: AVRO-3108 Drop Ruby 2.5 support

For the first time, the 1.11.0 release includes experimental support for
Rust. Work is continuing on this donated SDK, but we have not versioned and
published official artifacts for this release.

Python: The avro package fully supports Python 3. We will no longer publish a
separate avro-python3 package

And of course upgraded dependencies to latest versions, CVE fixes and more:
https://issues.apache.org/jira/issues/?jql=project%3DAVRO%20AND%20fixVersion%3D1.11.0

The link to all fixed JIRA issues and a brief summary can be found at:
https://github.com/apache/avro/releases/tag/release-1.11.0

In addition, language-specific release artifacts are available:

* C#: https://www.nuget.org/packages/Apache.Avro/1.11.0
* Java: from Maven Central,
* Javascript: https://www.npmjs.com/package/avro-js/v/1.11.0
* Perl: https://metacpan.org/release/Avro
* Python 3: https://pypi.org/project/avro/1.11.0
* Ruby: https://rubygems.org/gems/avro/versions/1.11.0

Thanks to everyone for contributing!


Re: android support in avro java libraries

2021-08-12 Thread Ryan Skraba
Thanks for the reference material!  I linked the JIRA to this conversation too.

If AVRO-2863 has listed all the necessary changes (some minor code
changes in core, filtering out the .reflect package and removing the
use of `Thread.withInitial`) then I don't really see any objection to
doing option 1 as a quick win, creating an `avro-android-1.11.0.jar`
artifact for the next release.  Have you already done some of this
work internally?  I asked the original author if he'd be willing to
share his experiments -- with a bit of helpful expertise, I don't see
why this wouldn't be doable for the next major release.

Ryan

On Wed, Aug 11, 2021 at 5:49 PM David Gang  wrote:
>
> Hi,
>
> If needed I could give here more detailed guidance.
> BR,
>
> On Wed, Aug 11, 2021 at 5:52 PM David Gang  wrote:
>>
>> Hi,
>>
>> Thanks for the quick response. I think that the jira issue AVRO-2863 
>> summarizes the issues.
>> When talking about android support it is also important to decide which 
>> android version you want to support. This is normally based on distribution 
>> statistics: https://www.appbrain.com/stats/top-android-sdk-versions
>> There are two main problems. Classes which won't be implemented by android 
>> platform (like ClassValue) and APIs which are just implemented by later 
>> versions.
>>
>> So when deciding that android is a platform which should be supported there 
>> are two options:
>>
>> 1. Not using classes which are not part of the android runtime (There are 
>> very few)
>> 2. Create two flavors of the library. For example guava have guava android 
>> and guava jre.
>>
>> The first option is easier to handle but i am not sure what the impact on 
>> the product will be. The second option would (maybe) better for performance 
>> but be more complicate to handle.
>>
>> Besides this it is also important to decide for an avro library version what 
>> the minimal android version it supports.
>>
>> Regarding how to catch this stuff, this is a hard question. The only idea I 
>> have is to introduce a regression test which should run in your CI system. 
>> Mockito had the same problems and this is the solution they did: 
>> https://github.com/mockito/mockito/issues/2341 The regression test could be 
>> run on emulators with different versions (or maybe robolectrics) and so 
>> errors could be catched.
>>
>> Hope this answers roughly the questions.
>>
>> Thanks,
>>
>>
>>
>> On Wed, Aug 11, 2021 at 5:28 PM Ryan Skraba  wrote:
>>>
>>> Hello!  I don't think it was a conscient choice in 1.9+ to "drift"
>>> away from android platform compatibility, and it's certainly worth the
>>> effort to make Avro usable for android developers.
>>>
>>> What would it take to bring the Java caode back into a compatible
>>> state?  Would we need to separate out some of the core functionality?
>>> Are there tools for detecting android incompatibility that we could
>>> put into the build process?  I'm not an android developer in the
>>> slightest so any guidance or contribution would be helpful.
>>>
>>> Ryan
>>>
>>> On Wed, Aug 11, 2021 at 4:14 PM David Gang  wrote:
>>> >
>>> > Hi,
>>> >
>>> > Does the developer team want to support the android platform?
>>> >
>>> > We are evaluating to use this library on android and got the exception of 
>>> > jira issue: https://issues.apache.org/jira/browse/AVRO-2863.
>>> >
>>> > Currently version 1.8.2 satisfies all our requirements but we want to 
>>> > know what the general attitude towards the android platform is.
>>> >
>>> >
>>> > Thanks


Re: Issue with ReflectDatumWriter With Enums

2021-06-30 Thread Ryan Skraba
Hello!  I'm pretty sure that I've used enums with implementations and
ReflectData successfully, even with old versions of Avro.

It seems to work with 1.9.x+ with the following ReflectDatumWriter
(where datum is an instance of the TestEnum):

  Encoder encoder = EncoderFactory.get().binaryEncoder(baos, null);
  DatumWriter w = new
ReflectDatumWriter<>(ReflectData.get().getSchema(TestEnum.class));
  w.write(datum, encoder);
  encoder.flush();

Do you have any extra detail about how you're constructing the
ReflectDatumWriter or getting the schema for the TestEnum?

On Fri, Jun 25, 2021 at 9:39 AM Swamy Thota  wrote:
>
> Hi All,
>
> I’m seeing an issue with ReflectDatumWriter when the enum implements methods 
> as below:
>
> enum TestEnum {
> V{
> @Override
> public boolean is_V(){
> return true;
> }
> }
>
> K{
> @Override
> public boolean is_K(){
> return true;
> }
> }
>
> public boolean is_V(){
> return false;
> }
>
> public boolean is_K(){
> return false;
> }
> }
>
> This type of enums are failing with SchemaParseException: Empty name. Is 
> there any work around or a fix available, appreciate the help in advance.
>
> Thanks,
> Swamy


Re: Setting a null value to field with default value

2021-03-24 Thread Ryan Skraba
Hello!  I can reproduce it in Avro 1.10.2, and I think this is a bug.
I raised https://issues.apache.org/jira/browse/AVRO-3091 to track it.
Thanks so much for the full example!

It looks like the workaround is to validate the defId in your own code
before building.  Until this is fixed, the newBuilder() won't catch
these errors (and I think it should).

Best regards, Ryan

On Tue, Mar 23, 2021 at 11:17 PM KV 59  wrote:
>
> Hi,
> I have a schema defined as below
>
>> {
>>   "type" : "record",
>>   "name" : "DefRecord",
>>   "namespace" : "com.test.ns1",
>>   "doc" : "Message",
>>   "fields" : [ {
>> "name" : "providerId",
>> "type" : {
>>   "type" : "string",
>>   "avro.java.string" : "String"
>> }
>>   }, {
>> "name" : "defId",
>> "type" : {
>>   "type" : "string",
>>   "avro.java.string" : "String"
>> }
>> "default" : ""
>>   }, {
>> "name" : "text",
>> "type" : [ "null", {
>>   "type" : "string",
>>   "avro.java.string" : "String"
>> } ],
>> "doc" : "the text",
>> "default" : null
>>   }, {
>> "name" : "count",
>> "type" : "int",
>> "doc" : "Number of segments in message",
>> "default" : 0
>>   }, {
>> "name" : "defBytes",
>> "type" : "bytes",
>> "default" : "ÿ"
>>   } ]
>> }
>
>
> I generated the classes using Java and using the builder tries to create an 
> object
>
> DefRecord defRecord = DefRecord.newBuilder()
> .setProviderId("providerId")
> .setCount(1)
> .setDefId(null)
> .build();
>
> The build runs successfully. (I expect this to fail as the field defId is not 
> nullable)
>
> When I serialize this object it throws a NullPointerException
>
> What I want to know is why the build is successful in the first place. 
> Shouldn't the build fail as we are trying to set a null value to a non 
> nullable field. Is there a reason for this behavior.
>
> Thanks
>


[ANNOUNCE] Apache Avro 1.10.2 released

2021-03-17 Thread Ryan Skraba
The Apache Avro community is pleased to announce the release of Avro 1.10.2!

All signed release artifacts, signatures and verification instructions can
be found here: https://avro.apache.org/releases.html

This release includes 31 Jira issues, including some interesting features:

C#: AVRO-3005 Support for large strings
C++: AVRO-3031 Fix for reserved keywords in generated code
Java: AVRO-2471 Fix for timestamp-micros in generated code
Java: AVRO-3060 Support ZSTD level and bufferpool options
Ruby: AVRO-2998 Records with symbol keys validation
Ruby: AVRO-3023 Validate with Ruby 3

Migration notes:
Python: AVRO-2656 The standard avro package supports Python 3, and
the avro-python3 package is in the process of being deprecated.

And of course upgraded dependencies to latest versions, CVE fixes and more:
https://issues.apache.org/jira/issues/?jql=project%20%3D%20AVRO%20AND%20fixVersion%20%3D%201.10.2

The link to all fixed JIRA issues and a brief summary can be found at:
https://github.com/apache/avro/releases/tag/release-1.10.2

In addition, language-specific release artifacts are available:

* C#: https://www.nuget.org/packages/Apache.Avro/1.10.2
* Java: from Maven Central,
* Javascript: https://www.npmjs.com/package/avro-js/v/1.10.2
* Perl: https://metacpan.org/release/Avro
* Python 3: https://pypi.org/project/avro/1.10.2/
* Ruby: https://rubygems.org/gems/avro/versions/1.10.2

Thanks to everyone for contributing!


Re: Avro Java - Validation of GenericRecord question

2021-01-05 Thread Ryan Skraba
Hello!

As you noticed, the validate method deliberately ignores the actual schema
of a record datum, and validates the field values by position.  It's
answering a slightly different question --> whether the datum (and it's
contents) could fit in the given schema.

For your use case, you might want to use the rules for schema compatibility:

SchemaCompatibility.SchemaPairCompatibility compatibility =
   SchemaCompatibility.checkReaderWriterCompatibility(userv1.getSchema(),
v2Schema);
assertThat(compatibility.getType(),
is(SchemaCompatibility.SchemaCompatibilityType.INCOMPATIBLE));

In your test, the built-in Avro schema resolution can't be used to convert
the userv1 datum to the v2Schema, so it reports INCOMPATIBLE.

If the V2 change were non-breaking (like adding a field with a default),
then the schemas would still be reported COMPATIBLE with that method.

Of course, if you just want to enforce that incoming records are strictly
and only the reference schema, you could simply check the two for equality:

user.getSchema().equals(v2Schema)

Is this what you're looking for?  I'm not familiar enough with records
produced using the Confluent Schema Registry!  I'm surprised this isn't
available in Kafka message metadata, you might want to check into their
implementation.

All my best, Ryan



On Tue, Jan 5, 2021 at 2:44 PM laurent broudoux 
wrote:

> Hello,
>
> I need to validate that a GenericRecord (read from a Kafka Topic) is valid
> regarding an Avro Schema. This reference schema
> is not necessarily the one used for Kafka message deserialization as this
> one was acquired through a Schema Registry.
>
> I had a look at GenericData.get().validate(schema, datum) but it does not
> behave as expected because it does not seem
> to validate record field names but only positions.
>
> Here's below a test case that represents the weird behaviour I am
> observing. I have used Avro 1.10.0 and 1.10.1 and both
> versions behave the same:
>
> @Test
> public void testGenericDataValidate() {
>Schema v1Schema = SchemaBuilder.record("User").fields()
>  .requiredString("name")
>  .requiredInt("age")
>  .endRecord();
>Schema v2Schema = SchemaBuilder.record("User").fields()
>  .requiredString("fullName")
>  .requiredInt("age")
>  .endRecord();
>
>GenericRecord userv1 = new GenericData.Record(v1Schema);
>userv1.put("name", "Laurent");
>userv1.put("age", 42);
>
>// The validate method succeeds because it does not validate the field
> name just the position... So the test fails.
>assertFalse(GenericData.get().validate(v2Schema, userv1));
> }
>
> This test corresponds to a real life scenario I want to detect : Kafka
> producer is still sending messages using the v1 schema but
> we expect records following v2 schema that introduced breaking change
> (field rename).
>
> Is it a known / desired limitation of the validate() method of GenericData
> ? Is there another way of achieving what I want to check ?
>
> Thanks!
>
>
>
>


[ANNOUNCE] Apache Avro 1.10.1 released

2020-12-04 Thread Ryan Skraba
Please note: I mistakenly sent this same message earlier today with the
wrong subject!  It is, in fact, 1.10.1 that was released.  My apologies!

The Apache Avro community is pleased to announce the release of Avro 1.10.1!

All signed release artifacts, signatures and verification instructions can
be
found here: https://avro.apache.org/releases.html

This release includes 33 Jira issues, including some interesting features:

C#: AVRO-2750 Support for enum defaults
C++: AVRO-2891 Expose last sync offset written on DataFileWriter
Java: AVRO-2924 SpecificCompiler add 'LocalDateTime' logical type
Java: AVRO-2937 Expose some missing flags in SpecificCompilerTool
PHP: AVRO-2096 Fixes to missing functions
Ruby: AVRO-2907 Ruby schema.single_object_schema_fingerprint is reversed

Migration notes:
Java: AVRO-2817 Turn off validateDefaults when reading legacy Avro files
Python: AVRO-2656 avro-python package is now the preferred python3 library
and
  avro-python3 is prepared to be deprecated

And of course upgraded dependencies to latest versions, CVE fixes and more:
https://issues.apache.org/jira/issues/?jql=project%20%3D%20AVRO%20AND%20fixVersion%20%3D%201.10.1

The link to all fixed JIRA issues and a brief summary can be found at:
https://github.com/apache/avro/releases/tag/release-1.10.1

In addition, language-specific release artifacts are available:

* C#: https://www.nuget.org/packages/Apache.Avro/1.10.1
* Java: from Maven Central,
* Javascript: https://www.npmjs.com/package/avro-js/v/1.10.1
* Python 2: https://pypi.org/project/avro/1.10.1/
* Python 3: https://pypi.org/project/avro-python3/1.10.1/
* Ruby: https://rubygems.org/gems/avro/versions/1.10.1

Thanks to everyone for contributing!


Re: Failure in writing BigDecimal as decimal logical type

2020-10-07 Thread Ryan Skraba
Hmmm, I guess I was fuzzy on the details. I thought the SpecificData model
understood the standard logical types even when it was used on "generic"
data, which is apparently not the case!  I guess these conversions are
*only* built-in automatically when you use generated code.

For the reasoning, keep in mind the schema and binary representation didn't
really change between 1.7.x and 1.8.x!  A producer and consumer of Avro
binary data could be using two different versions of Avro and still
communicate successfully (1.7.x ignoring the logical type).  More
importantly, the API between the two was impressively compatible.  This was
particularly important in 2015/2016 when there were so many different
versions (and distributions) of Hadoop and Spark that were very gradually
transitioning between 1.7.x and 1.8.x -- I know we were writing libraries
using Avro that should run correctly with either 1.7.x or 1.8.x, even with
data generated by 1.8.x.

Versioning and compatibility is still a hot topic of conversation for Avro
today... it's definitely not a solved problem.

I know that Beam lets you specify the model in AvroIO.  If Parquet is
exclusively using the singleton SpecificData, is it alright with Beam
pipelines using generic data?  If you can't succeed in getting it to work,
please do raise a JIRA because it's likely a problem in other big data
execution engines as well!




On Wed, Oct 7, 2020 at 12:28 AM Bashir Sadjad  wrote:

> Thanks a lot Ryan, this was very helpful. It resolved the immediate
> problem I had (in the minimal example I posted before). But a more complete
> context of the issue that I had is described here
> <https://lists.apache.org/thread.html/rabfc94fd696651c1f28f245fd682366ba5a4e552317ccd5ccc3f7a63%40%3Cuser.beam.apache.org%3E>
> where after serializing these GenericRecords I was writing them into
> Parquet files. With your hint, I could resolve another related issue there
> too. I just put it here for future reference:
>
> The issue was in AvroWriteSupport of Parquet, specifically here
> in writeValueWithoutConversion
> <https://github.com/apache/parquet-mr/blob/0a4e3eea991f7588c9c5e056e9d7b32a76eed5da/parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java#L348>
>  where
> I was getting a cast exception:
>
> java.lang.ClassCastException: class java.math.BigDecimal cannot be cast to
> class java.nio.ByteBuffer (java.math.BigDecimal and java.nio.ByteBuffer are
> in module java.base of loader 'bootstrap')
>
> I debugged this a little more and the way AvroWriteSupport instance is
> created is by using SpecificData as its model (e.g., here
> <https://github.com/apache/parquet-mr/blob/0a4e3eea991f7588c9c5e056e9d7b32a76eed5da/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetWriter.java#L163>).
> Using your trick, I could resolve that other issue with adding this to my
> code:
>
> SpecificData.get().addLogicalTypeConversion(new DecimalConversion());
>
> I haven't yet tried reading those Parquet files but I might need to do
> something similar to be able to read those BigDecimals properly.
>
> BTW, I did not fully understand the reasoning behind why support for these
> logical types are not in GenericData by default (I mean if logical types
> were not present before 1.8.x why backward compatibility for reading was an
> issue? Older schema did not have logical types, right?). I am sure there
> are good reasons that I don't understand because I don't have the full
> context but this certainly was not trivial to add DecimalConversion
> manually.
>
> Thanks again for your help.
>
> -B
>
> On Tue, Oct 6, 2020 at 10:56 AM Ryan Skraba  wrote:
>
>> Hello!  This is a frequent stumbling block for logical types.
>>
>> You should explicitly add the Decimal logical type conversion to the
>> data model that interprets the Java datum being serialized to your
>> file, like this:
>>
>> GenericData model = new GenericData();
>> model.addLogicalTypeConversion(new Conversions.DecimalConversion());
>> DatumWriter datumWriter = new
>> GenericDatumWriter<>(r.getSchema(), model);
>>
>> (You can also add it to the singleton GenericData.get() instance used
>> by your application.  I tend to prefer explicitly setting the model.)
>>
>> As an explanation: when logical types were added to Avro in 1.8.x, the
>> "standard" logical types weren't automatically added to the
>> GenericData model, likely to ensure that the reading behaviour remains
>> unchanged from 1.7.x (unless specifically requested).  Although I've
>> seldom seen user-defined logical types with Avro, they would also need
>> to be added explicitly.  This problem doesn't occur with generated
>> code and specific records, since the 

Re: Failure in writing BigDecimal as decimal logical type

2020-10-06 Thread Ryan Skraba
Hello!  This is a frequent stumbling block for logical types.

You should explicitly add the Decimal logical type conversion to the
data model that interprets the Java datum being serialized to your
file, like this:

GenericData model = new GenericData();
model.addLogicalTypeConversion(new Conversions.DecimalConversion());
DatumWriter datumWriter = new
GenericDatumWriter<>(r.getSchema(), model);

(You can also add it to the singleton GenericData.get() instance used
by your application.  I tend to prefer explicitly setting the model.)

As an explanation: when logical types were added to Avro in 1.8.x, the
"standard" logical types weren't automatically added to the
GenericData model, likely to ensure that the reading behaviour remains
unchanged from 1.7.x (unless specifically requested).  Although I've
seldom seen user-defined logical types with Avro, they would also need
to be added explicitly.  This problem doesn't occur with generated
code and specific records, since the conversions are decided when the
code is generated.

I hope this is useful!  Best regards, Ryan


On Tue, Oct 6, 2020 at 7:21 AM Bashir Sadjad  wrote:
>
> Hi all,
>
> I do not have a lot of experience using Avro, so hopefully this is not an 
> obvious question (and I hope this is the right place to ask):
>
> I have a schema with a decimal logical type, e.g.,
>
> {
>   "type" : "record",
>   "name" : "testRecord",
>   "namespace" : "org.example",
>   "doc" : "",
>   "fields" : [
> {
>   "name" : "value",
>   "type":[
> "null",
> {
>   "type":"bytes",
>   "logicalType":"decimal",
>   "precision":12,
>   "scale":4
> }
>   ],
>   "doc":"",
>   "default":null
> }
>   ]
> }
>
> And I have a code that parses this schema and creates a GenericRecord based 
> on that then puts a BigDecimal for "value" (I have copied the full code at 
> the end). The problem is that when I write this record to file, I get the 
> following exception which IIUC is coming from the fact that there are no 
> conversions registered for BigDecimal here:
>
> org.apache.avro.file.DataFileWriter$AppendWriteException: 
> org.apache.avro.AvroRuntimeException: Unknown datum type 
> java.math.BigDecimal: 10.0
>at org.apache.avro.file.DataFileWriter.append (DataFileWriter.java:317)
>at org.openmrs.analytics.TestAvro.main (TestAvro.java:25)
>at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:254)
>at java.lang.Thread.run (Thread.java:834)
> Caused by: org.apache.avro.AvroRuntimeException: Unknown datum type 
> java.math.BigDecimal: 10.0
>at org.apache.avro.generic.GenericData.getSchemaName (GenericData.java:912)
>at org.apache.avro.generic.GenericData.resolveUnion (GenericData.java:874)
>at org.apache.avro.generic.GenericDatumWriter.resolveUnion 
> (GenericDatumWriter.java:272)
>at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion 
> (GenericDatumWriter.java:143)
>at org.apache.avro.generic.GenericDatumWriter.write 
> (GenericDatumWriter.java:83)
>at org.apache.avro.generic.GenericDatumWriter.writeField 
> (GenericDatumWriter.java:221)
>at org.apache.avro.generic.GenericDatumWriter.writeRecord 
> (GenericDatumWriter.java:210)
>at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion 
> (GenericDatumWriter.java:131)
>at org.apache.avro.generic.GenericDatumWriter.write 
> (GenericDatumWriter.java:83)
>at org.apache.avro.generic.GenericDatumWriter.write 
> (GenericDatumWriter.java:73)
>at org.apache.avro.file.DataFileWriter.append (DataFileWriter.java:314)
>at org.openmrs.analytics.TestAvro.main (TestAvro.java:25)
>at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:254)
>at java.lang.Thread.run (Thread.java:834)
>
> But my understanding is that BigDecimal is the right type for the Avro 
> decimal logical type, is that correct? If yes, shouldn't this approach work? 
> I can do a conversion from BigDecimal to ByteBuffer but that is something 
> that I want to avoid because in my real use case, I am receiving 
> GenericRecord from another library that creates the schema too, possibly with 
> many such logical types. Here is the full code:
>
> import java.io.File;
> import java.io.IOException;
> import java.math.BigDecimal;
>
> import org.apache.avro.Schema;
> import org.apache.avro.file.DataFileWriter;
> import org.apache.avro.generic.GenericData.Record;
> import org.apache.avro.generic.GenericDatumWriter;
> import org.apache.avro.generic.GenericRecord;
> import org.apache.avro.io.DatumWriter;
>
> public class TestAvro {
>   public static void main(String[] args) throws IOException {
> Schema testSchema = new Schema.Parser().parse(new 
> File("tmp/test_decimal_union.avsc"));
> GenericRecord testRecord = new Record(testSchema);
> testRecord.put("value", BigDecimal.valueOf(10.0));
> DatumWriter datumWriter = new 
> GenericDatumWriter<>(testSchema);
> 

Re: working with Avro records and schemas, programmatically

2020-09-18 Thread Ryan Skraba
Hello Colin, you've hit one bit of fussiness with the Java SDK... you
can't reuse a Schema.Field object in two Records, because a field
knows its own position in the record[1].  If a field were to belong to
two records at different positions, this method would have an
ambiguous response.

As a workaround, since Avro 1.9, there's a copy constructor that you
can use to clone the field:

List clonedFields = existingFields.stream()
.map(f -> new Schema.Field(f, f.schema()))
.collect(Collectors.toList());

That being said, I don't see any reason we MUST throw an exception.
There's a couple of alternative strategies we could use in the Java
SDK:

1. If the position is the same in both records, allow the field to be
reused (which enables cloning use cases).

2. Make a copy of the field to reuse internally if the position is
already set (probably OK, since it's supposed to be immutable).

3. Allow the field to be reused, only throw the exception only if
someone calls the position() method later.

Any of those sound like a useful change for your use case?  Don't
hesitate to create a JIRA or contribution if you like!

All my best, Ryan

On Fri, Sep 18, 2020 at 8:27 AM Colin Williams
 wrote:
>
> Hello,
>
> I'm trying to understand working with Avro records and schemas,
> programmatically. Then I was first trying to create a new schema and
> records based on existing records, but with a different name /
> namespace. It seems then I don't understand getFields() or
> createRecord(...). Why can't I use the fields obtained from
> getFields() in createRecord()?  How would I go about this properly?
>
> // for an existing record already present
> GenericRecord someRecord
>
> // get a list of existing fields
> List existingFields = someRecord.getSchema().getFields();
>
> // schema for new record with existing fields
> Schema updatedSchema = createRecord("UpdatedName",
> "","avro.com.example.namespace" , false, existingFields);
>
> ^^ throws an exception ^^
>
> /* Caused by: org.apache.avro.AvroRuntimeException: Field already
> used: eventMetadata type:UNION pos:0
> at org.apache.avro.Schema$RecordSchema.setFields(Schema.java:888)
> at org.apache.avro.Schema$RecordSchema.(Schema.java:856)
> at org.apache.avro.Schema.createRecord(Schema.java:217)
> */
>
> final int length = fields.size();
>
> GenericRecord clonedRecord = new GenericData.Record(updatedSchema);
> for (int i = 0; i < length; i++) {
> final Schema.Field field = existingFields.get(i);
> clonedRecord.put(i, someRecord.get(i));
> }
>
>
> Best Regards,
>
> Colin Williams


Re: AvroTypeException: Attempt to process a double when a string was expected

2020-08-03 Thread Ryan Skraba
Hello!  Thanks for the MCVE -- I could reproduce your symptoms easily!

Even when you're using JSON encoding, you should use a
GenericDatumReader<> to read generic datum.

The Json.ObjectReader sounds correct but is actually for a different
JSON use case (storing any arbitrary JSON snippet in a known, valid
Avro schema).  The error message is not very helpful, unfortunately,
but points to the differences between the first UNION in your record
and the Json.avsc that Json.ObjectReader implicitly assumes.

I hope this helps!  Ryan


On Thu, Jul 30, 2020 at 7:30 PM Stefan Radu Popescu
 wrote:
>
> Hello,
>
> I am having an issue with reading a GenericDatum from a json String written 
> with avro GenericDatumWriter.
>
> Exception as in title.
> MCVE and details at:
> https://gist.github.com/PopescuStefanRadu/cea799614ba49b753238b77ac4b242e5
>
> Avro versions tested:
>
> 1.9.2, 1.10.0
>
> Java runtime:
>
> openjdk 14.0.1 2020-04-14
> OpenJDK Runtime Environment (build 14.0.1+7-Ubuntu-1ubuntu1)
> OpenJDK 64-Bit Server VM (build 14.0.1+7-Ubuntu-1ubuntu1, mixed mode, sharing)
>
> Compiled towards: java 11.
>
> I'd be very grateful if you could help me out.
>
> Thank you,
> Stefan Popescu


Re: Counting bytes read

2020-07-29 Thread Ryan Skraba
Hi,

You've got it right: the DataFileReader and DataFileStream read a
block at a time, and "fileReader.tell()" sits at the sync marker
between blocks while records are being read from the current block.
You're probably aware that DataFileReader is only seekable to block
boundaries.

The entire block is read from disk and used for the source of the next
N records, so it literally *is* the number of bytes that were read at
the time the current record was emitted, and would take into account
the file compression if any (not the strict size of the binary-encoded
record).

The number of accumulated bytes while decoding per-record doesn't look
like it's exposed, but might be able to be accessed through the binary
decoder used in the DatumReader.  If that doesn't work, maybe make a
JIRA feature request to expose this information -- I can see it being
useful for metrics like yours.

I hope this helps, let us know if you find a solution!  Ryan



On Mon, Jul 27, 2020 at 7:46 PM Jeremy Custenborder
 wrote:
>
> Not sure off hand. I thought you were just reading sequentially.
>
> On Sun, Jul 26, 2020 at 12:15 AM Julien Phalip  wrote:
> >
> > Hi Jeremy,
> >
> > Thanks for your reply. I'm currently using DataFileReader because I also 
> > need to use random access/seeks. Would that be possible with DataFileStream 
> > as well? Or is there another technique that could work?
> >
> > Julien
> >
> > On Sat, Jul 25, 2020 at 9:36 PM Jeremy Custenborder 
> >  wrote:
> >>
> >> Could you use DataFileStream and pass in your own stream? Then you
> >> could get bytes read.
> >>
> >> [1] 
> >> https://avro.apache.org/docs/1.9.2/api/java/org/apache/avro/file/DataFileStream.html
> >>
> >> On Sat, Jul 25, 2020 at 7:42 PM Julien Phalip  wrote:
> >> >
> >> > Hi,
> >> >
> >> > I'd like to keep track of the number of bytes read as I'm reading 
> >> > through the records of an Avro file.
> >> >
> >> > See this sample code:
> >> >
> >> > File file = new File("mydata.avro");
> >> > DatumReader reader = new GenericDatumReader<>();
> >> > DataFileReader fileReader = new DataFileReader<>(file, 
> >> > reader);
> >> > GenericRecord record = new GenericData.Record(fileReader.getSchema());
> >> > long counter = 0;
> >> > while (fileReader.hasNext()) {
> >> > fileReader.next(record);
> >> > counter += // Magic happens here
> >> > System.out.println("Bytes read so far: " + counter);
> >> > }
> >> >
> >> > I can't seem to find a way to extract that information from the 
> >> > `fileReader` or  `record` objects. I figured maybe `fileReader.tell()` 
> >> > might help here, but that value seems to stay stuck on the current 
> >> > block's position.
> >> >
> >> > Is this possible?
> >> >
> >> > Thanks!
> >> >
> >> > Julien


Re: Logo

2020-05-12 Thread Ryan Skraba
Hello!  There's a policy for trademarks, service marks, and graphic
logos at https://www.apache.org/foundation/marks/

It sounds like you're using the Apache Avro logo to refer to the
Apache Avro project (see "nominative use" at the first link above),
which is usually OK. There are some additional guidelines at
https://www.apache.org/foundation/marks/guide#logos

All my best, Ryan

On Mon, May 11, 2020 at 4:42 PM Miguel Silvestre  wrote:
>
> Hi,
>
> Can I use the logo internally on my company?
> I pretend to use the log on a git project as an Icon.
>
> Thank you
> --
> Miguel Silvestre


Re: Decimal type, limitation on scale

2020-03-03 Thread Ryan Skraba
It looks like the "scale must be less than precision" rule comes from
Hive requirements[1] (although while searching, this is called into
question elsewhere in Hive[2]). From the design document, the
requirement was specifically to avoid variable (per-row scale):

> For instance, applications (particularly native applications) such as SAS 
> which need to
> pre-allocate memory require fixed types to do so efficiently.

I believe that if we were to write a file (for example) with a
negative scale using Avro 1.10, a reader with an older version
_should_ just fall back to bytes, which seems fair enough.  I would
consider it a bug if the reader just failed on an "out-of-bounds"
scale!

Any thoughts on what Hive (as an example) would require if we were to
relax this constraint in the spec?

Ryan

[1]: https://issues.apache.org/jira/browse/HIVE-3976
[2]: 
https://github.com/apache/hive/blob/94dca16e4eb3caf7dcaa43ae92807e5750e1ff04/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFRound.java#L54

On Mon, Mar 2, 2020 at 9:53 PM Zoltan Farkas
 wrote:
>
> +dev adding the dev mailing list, maybe somebody there can answer the 
> reasoning.
>
> when comparing sql server with Oracle and Postgress:
>
> https://docs.microsoft.com/en-us/sql/t-sql/data-types/decimal-and-numeric-transact-sql?view=sql-server-ver15
>  
> 
>
> https://docs.oracle.com/cd/A84870_01/doc/server.816/a76965/c10datyp.htm#743 
> 
> https://www.postgresql.org/docs/9.1/datatype-numeric.html 
> 
>
>
> One allows for negative scale, the other doesn’t.
> My biggest issue with the current decimal spec is that it does not encode the 
> scale (uses the scale defined in the schema), as such it cannot accommodate a 
> Oracle and Postgres NUMBER without scale coercion.
>
> there are other differences (like NAN, …)
>
> But there is no reason why the decimal2 logical type should not be created to 
> address the above…
>
> or even better promote decimal to a first class 
> type...https://issues.apache.org/jira/browse/AVRO-2164 
> 
>
>
> —Z
>
> > On Mar 2, 2020, at 2:34 PM, Christopher Egerton  wrote:
> >
> > Hi all,
> >
> > I've been trying to do some research on the logical decimal type and why 
> > the scale of a decimal type must be between zero and the precision of the 
> > type, inclusive. The ticket https://issues.apache.org/jira/browse/AVRO-1402 
> >  has a lot of discussion 
> > around the design of the type, but I haven't been able to find any 
> > rationale for the limitations on the scale of the type.
> >
> > These don't appear to align with existing conventions for precision and 
> > scale in the context of SQL numeric types, the JDBC API, and the Java 
> > standard library's BigDecimal class. In these contexts, the precision must 
> > be a positive number, but the scale can be any value--positive 
> > (representing the number of digits of precision that are available after 
> > the decimal point), negative (representing the number of trailing zeroes at 
> > the end of the number before an implicit decimal point), or zero. It is not 
> > bounded by the precision of the type.
> >
> > The definitions for scale and precision appear to align across these 
> > contexts, including the Avro spec, so I'm curious as to why the Avro 
> > spec--seemingly an anomaly--is the only one to declare these limitations on 
> > what the scale of a decimal type can be.
> >
> > Does anyone know why these exist, and if not, would it be okay to file a 
> > ticket to remove them from the spec and begin work on it?
> >
> > Cheers,
> >
> > Chris
>


[ANNOUNCE] Apache Avro 1.9.2 released

2020-02-13 Thread Ryan Skraba
The Apache Avro community is pleased to announce the release of Avro 1.9.2!

The link to all fixed JIRA issues and a brief summary can be found at:
https://github.com/apache/avro/releases/tag/release-1.9.2

This release includes 73 Jira issues:
https://jira.apache.org/jira/issues/?jql=project%20%3D%20AVRO%20AND%20fixVersion%20%3D%201.9.2

Some bug fixes:
* C#: AVRO-2606 handle multidimensional arrays of custom types
* Java: AVRO-2592 Avro decimal fails on some conditions
* Java: AVRO-2641 Generated code results in java.lang.ClassCastException
* Java: AVRO-2663 Projection on nested records does not work
* Python: AVRO-2429 unknown logical types should fall back
Improvements:
* Java: AVRO-2247 Improve Java reading performance with a new reader
* Python: AVRO-2104 Schema normalisation and fingerprint support for Python 3
Work to unify Python2 and Python3 APIs in preparation for sunset.
Improved tests
Improved, more reliable builds.
Improved readability
Upgraded dependencies to latest versions, including CVE fixes.
And more...

This release can be downloaded from: https://www.apache.org/dyn/closer.cgi/avro/

The released artifacts are available:
* C#: https://www.nuget.org/packages/Apache.Avro/1.9.2
* Java: from Maven Central,
* Javascript: https://www.npmjs.com/package/avro-js/v/1.9.2
* Python 2: https://pypi.org/project/avro/1.9.2/
* Python 3: https://pypi.org/project/avro-python3/1.9.2.1/
  - See https://issues.apache.org/jira/browse/AVRO-2737
* Ruby: https://rubygems.org/gems/avro/versions/1.9.2

Thanks to everyone for contributing!

Ryan Skraba


Re: How to serialize & deserialize contiguous block of GenericRecords

2020-01-30 Thread Ryan Skraba
Ah!  OK, I think I understand better.

Your serialize method looks almost OK -- as I mentioned, you can use an
OutputStream wrapper to write directly to a ByteBuffer.  This wrapper
doesn't exist in the Java utilities AFAIK, but there are examples on the
web (
https://github.com/EsotericSoftware/kryo/blob/master/src/com/esotericsoftware/kryo/io/ByteBufferOutputStream.java).
The one I mentioned in the previous message wraps a list of ByteBuffers.

In any case, don't forget to *encoder.flush()* before closing the
outputStream in your serialize!

Your deserialize is a bit problematic, because the *entire* byte buffer
capacity will be passed if you use buffer.array(), not just the bytes that
were used.

Fortunately, you can use the ByteBufferInputStream already present in Avro
to handle this.  The code would look something like:

ByteBufferInputStream bbais = new
ByteBufferInputStream(Collections.singletonList(buffer));
final BinaryDecoder decoder = DecoderFactory.get().binaryDecoder(bbais, null);
final GenericDatumReader datumReader = new
GenericDatumReader<>(schema);

List out = new ArrayList<>();
while (!decoder.isEnd()) {
  GenericRecord record = datumReader.read(null, decoder);
  // Your transformation of record to event, and add to the list here...
}
return out;

It's critical that buffer has the position and limit set correctly to the
start and end of the binary data before entering this method, of course!
The position and limit will not be correct coming out of the serialize
method, although probably a buffer.flip() will do what you want.

I hope this is useful, all my best, Ryan


On Wed, Jan 29, 2020 at 7:35 PM Pedro Cardoso 
wrote:

> Hi Ryan,
>
> Thank you so much for your reply! You were right about the encoder in the
> serializer method, that was my mistake. I submitted a png rather than just
> text because I thought the highlighting would help.
> I may not have been very clear about my question, I understand that via
> the DatumWriter/DatumReader I can serialize and deserialize a given Avro
> GenericRecord respectively.
>
> My question is, consider several GenericRecords all concatenated into a
> single byte array as follows:
>
> *[serializedGenericRecord1, serializedGenericRecord2,
> serializedGenericRecord3, etc...]*
>
> How can I deserialize them using the DatumReader API? If it's possible
> out-of-the-box can you point me in the right direction?
> Does this make sense?
>
> See the code below (in text this time :) ) if it helps:
>
> public void serialize(final List events, final UUID schemaId, final 
> ByteBuffer buffer) throws IOException {
> final Schema schema = getAvroSchema(schemaId);
> final ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
> final Encoder encoder = EncoderFactory.get().binaryEncoder(outputStream, 
> null);
> final GenericDatumWriter datumWriter = new 
> GenericDatumWriter<>(schema);
>
> for (final Event event : events) {
> final GenericData.Record record = new GenericData.Record(schema);
> //populate record object
> datumWriter.write(record, encoder);
> }
>
> outputStream.close();
> buffer.put(outputStream.toByteArray());
> }
>
> public List deserialize(final ByteBuffer buffer, final UUID schemaId) 
> throws IOException {
> final List events = new ArrayList<>();
> final Schema schema = getAvroSchema(schemaId);
> final BinaryDecoder decoder = 
> DecoderFactory.get().binaryDecoder(buffer.array(), null);
> final GenericDatumReader datumReader = new 
> GenericDatumReader<>(schema);
> GenericRecord record = new GenericData.Record(schema);
>
> // How do I loop?
> record  = datumReader.read(record, decoder);
> // populate Event object and add to list
>
> return events;
> }
>
>
> Thank you once again for your help!
>
> Cheers
> Pedro Cardoso
>
> Research Data Engineer
>
> pedro.card...@feedzai.com
>
>
>
>
> [image: Follow Feedzai on Facebook.] 
> <https://www.facebook.com/Feedzai/>[image:
> Follow Feedzai on Twitter!] <https://twitter.com/feedzai>[image: Connect
> with Feedzai on LinkedIn!] <https://www.linkedin.com/company/feedzai/>
> <https://feedzai.com/>[image: Feedzai in Forbes Fintech 50!]
> <https://www.forbes.com/fintech/list/>
>
>
> On Wed, Jan 29, 2020 at 5:34 PM Ryan Skraba  wrote:
>
>> Hello!
>>
>> It's a bit difficult to discover what's going wrong -- I'm not sure that
>> the code in the image corresponds to the exception you are encountering!
>> Notably, there's no reference to DataFileStream...  Typically, it would be
>> easier with code as TXT than as PNG!
>>
>> It is definitely possible to serialize Avro G

Re: How to serialize & deserialize contiguous block of GenericRecords

2020-01-29 Thread Ryan Skraba
Hello!

It's a bit difficult to discover what's going wrong -- I'm not sure that
the code in the image corresponds to the exception you are encountering!
Notably, there's no reference to DataFileStream...  Typically, it would be
easier with code as TXT than as PNG!

It is definitely possible to serialize Avro GenericRecords into bytes!  The
example code looks like it's using the DataFileWriter (and ignoring the
Encoder).  Keep in mind that this creates an Avro file (also known as a
Avro Object Container file or .avro file).  This is more than just "pure"
serialized bytes -- it contains some header information and sync markers,
which makes it easier to split and process a single file on multiple nodes
in big data.

If you were to use a DatumWriter and an encoder, you could obtain just the
"pure" binary data without any framing bytes.  If that is your goal, I
suggest looking into the DatumWriter / DatumReader classes (as opposed to
the DataFileXxx classes).

>From the given exception "Invalid sync" it looks like you might be writing
pure Avro bytes and attempting to read the file format.

Since the DatumWriter API uses OutputStream (instead of ByteBuffer),
there's a utility class called ByteBufferOutputStream that you might find
interesting.  It permits writing to a series of 8K java.nio.ByteBuffer
instances, which might be OK for your use case.  There are other
implementations of ByteBuffer-backed OutputStreams available that might be
better suited.

I hope this is useful, Ryan


On Wed, Jan 29, 2020 at 4:22 PM Pedro Cardoso 
wrote:

> Hello,
>
> I am trying to write a sequence of Avro GenericRecords into a Java
> ByteBuffer and later on deserialize them. I have tried using
> FileWriter/Readers and copying the content of the underlying buffer to my
> target object. The alternative is to try to split a ByteBuffer by the
> serialized GenericRecords individually and use a BinaryDecoder to read each
> property of a record individually.
>
> Please see attached such an example of the former code.
> The presented code fails with
>
> org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync!
> at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:223)
> at com.feedzai.research.experiments.bookkeeper.Avro.main(Avro.java:97)
> Caused by: java.io.IOException: Invalid sync!
> at
> org.apache.avro.file.DataFileStream.nextRawBlock(DataFileStream.java:318)
> at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:212)
> ... 1 more
>
> Hence my questions are:
>  - Is it at all possible to serialize/deserialize lists of Avro records to
> a ByteBuffer and back?
>  - If so, can anyone point me in the right direction?
>  - If not, can anyone point me to code examples of alternative solutions?
>
> Thank you and have a good day.
>
> Pedro Cardoso
>
> Research Data Engineer
>
> pedro.card...@feedzai.com
>
>
>
>
> [image: Follow Feedzai on Facebook.] 
> [image:
> Follow Feedzai on Twitter!] [image: Connect
> with Feedzai on LinkedIn!] 
> [image: Feedzai in Forbes Fintech 50!]
> 
>
> *The content of this email is confidential and intended for the recipient
> specified in message only. It is strictly prohibited to share any part of
> this message with any third party, without a written consent of the sender.
> If you received this message by mistake, please reply to this message and
> follow with its deletion, so that we can ensure such a mistake does not
> occur in the future.*


Re: avro-tools illegal reflective access warnings

2020-01-17 Thread Ryan Skraba
Hello!  I just created a JIRA for this as an improvement :D
https://issues.apache.org/jira/browse/AVRO-2689

To check evolution, we'd probably want to specify the reader schema in
the GenericDatumReader created here:
https://github.com/apache/avro/blob/master/lang/java/tools/src/main/java/org/apache/avro/tool/DataFileReadTool.java#L75

The writer schema is automatically set when the DataFileStream is
created.  If we want to set a different reader schema (than the one
found in the file), it should be set by calling
reader.setExpected(readerSchema) just after the DataFileStream is
created.

I think it's a pretty good idea -- it feels like we're seeing more
questions about schema evolution these days, so that would be a neat
way for a user to test (or to create reproducible scenarios for bug
reports).  If you're interested, feel free to take the JIRA!  I'd be
happy to help out.

Ryan


On Fri, Jan 17, 2020 at 2:22 PM roger peppe  wrote:
>
> On Thu, 16 Jan 2020 at 17:21, Ryan Skraba  wrote:
>>
>> didn't find anything currently in the avro-tools that uses both
>> reader and writer schemas while deserializing data...  It should be a
>> pretty easy feature to add as an option to the DataFileReadTool
>> (a.k.a. tojson)!
>
>
> Thanks for that suggestion. I've been delving into that code a bit and trying 
> to understand what's going on.
>
> At the heart of it is this code:
>
> GenericDatumReader reader = new GenericDatumReader<>();
> try (DataFileStream streamReader = new DataFileStream<>(inStream, 
> reader)) {
>   Schema schema = streamReader.getSchema();
>   DatumWriter writer = new GenericDatumWriter<>(schema);
>   JsonEncoder encoder = EncoderFactory.get().jsonEncoder(schema, out, 
> pretty);
>
> I'm trying to work out where the best place to put the specific reader schema 
> (taken from a command line flag) might be.
>
> Would it be best to do it when creating the DatumReader (it looks like there 
> might be a way to create that with a generic writer schema and a specific 
> reader schema, although I can't quite see how to do that atm), or when 
> creating the DatumWriter?
> Or perhaps there's a better way?
>
> Thanks for any guidance.
>
>cheers,
> rog.
>>
>>
>> You are correct about running ./build.sh dist in the java directory --
>> it fails with JDK 11 (likely fixable:
>> https://issues.apache.org/jira/browse/MJAVADOC-562).
>>
>> You should probably do a simple mvn clean install instead and find the
>> jar in lang/java/tools/target/avro-tools-1.10.0-SNAPSHOT.jar.  That
>> should work with JDK11 without any problem (well-tested in the build).
>>
>> Best regards, Ryan
>>
>>
>>
>> On Thu, Jan 16, 2020 at 5:49 PM roger peppe  wrote:
>> >
>> > Update: I tried running `build.sh dist` in `lang/java` and it failed (at 
>> > least, it looks like a failure message) after downloading a load of Maven 
>> > deps with the following errors: 
>> > https://gist.github.com/rogpeppe/df05d993254dc5082253a5ef5027e965
>> >
>> > Any hints on what I should do to build the avro-tools jar?
>> >
>> >   cheers,
>> > rog.
>> >
>> > On Thu, 16 Jan 2020 at 16:45, roger peppe  wrote:
>> >>
>> >>
>> >> On Thu, 16 Jan 2020 at 13:57, Ryan Skraba  wrote:
>> >>>
>> >>> Hello!  Is it because you are using brew to install avro-tools?  I'm
>> >>> not entirely familiar with how it packages the command, but using a
>> >>> direct bash-like solution instead might solve this problem of mixing
>> >>> stdout and stderr.  This could be the simplest (and right) solution
>> >>> for piping.
>> >>
>> >>
>> >> No, I downloaded the jar and am directly running it with "java -jar 
>> >> ~/other/avro-tools-1.9.1.jar".
>> >> I'm using Ubuntu Linux 18.04 FWIW - the binary comes from Debian package 
>> >> openjdk-11-jre-headless.
>> >>
>> >> I'm going to try compiling avro-tools myself to investigate but I'm a 
>> >> total Java ignoramus - wish me luck!
>> >>
>> >>>
>> >>> alias avrotoolx='java -jar
>> >>> ~/.m2/repository/org/apache/avro/avro-tools/1.9.1/avro-tools-1.9.1.jar'
>> >>> avrotoolx tojson x.out 2> /dev/null
>> >>>
>> >>> (As Fokko mentioned, the 2> /dev/null isn't even necessary -- the
>> >>> warnings and logs should not be piped along with the normal content.)
>> >>>
>> >>> Otherwise, IIRC

Re: avro-tools illegal reflective access warnings

2020-01-16 Thread Ryan Skraba
Hello!  For a simple, silent log4j, I use:

$ cat /tmp/log4j.properties
log4j.rootLogger=off

I didn't find anything currently in the avro-tools that uses both
reader and writer schemas while deserializing data...  It should be a
pretty easy feature to add as an option to the DataFileReadTool
(a.k.a. tojson)!

You are correct about running ./build.sh dist in the java directory --
it fails with JDK 11 (likely fixable:
https://issues.apache.org/jira/browse/MJAVADOC-562).

You should probably do a simple mvn clean install instead and find the
jar in lang/java/tools/target/avro-tools-1.10.0-SNAPSHOT.jar.  That
should work with JDK11 without any problem (well-tested in the build).

Best regards, Ryan



On Thu, Jan 16, 2020 at 5:49 PM roger peppe  wrote:
>
> Update: I tried running `build.sh dist` in `lang/java` and it failed (at 
> least, it looks like a failure message) after downloading a load of Maven 
> deps with the following errors: 
> https://gist.github.com/rogpeppe/df05d993254dc5082253a5ef5027e965
>
> Any hints on what I should do to build the avro-tools jar?
>
>   cheers,
> rog.
>
> On Thu, 16 Jan 2020 at 16:45, roger peppe  wrote:
>>
>>
>> On Thu, 16 Jan 2020 at 13:57, Ryan Skraba  wrote:
>>>
>>> Hello!  Is it because you are using brew to install avro-tools?  I'm
>>> not entirely familiar with how it packages the command, but using a
>>> direct bash-like solution instead might solve this problem of mixing
>>> stdout and stderr.  This could be the simplest (and right) solution
>>> for piping.
>>
>>
>> No, I downloaded the jar and am directly running it with "java -jar 
>> ~/other/avro-tools-1.9.1.jar".
>> I'm using Ubuntu Linux 18.04 FWIW - the binary comes from Debian package 
>> openjdk-11-jre-headless.
>>
>> I'm going to try compiling avro-tools myself to investigate but I'm a total 
>> Java ignoramus - wish me luck!
>>
>>>
>>> alias avrotoolx='java -jar
>>> ~/.m2/repository/org/apache/avro/avro-tools/1.9.1/avro-tools-1.9.1.jar'
>>> avrotoolx tojson x.out 2> /dev/null
>>>
>>> (As Fokko mentioned, the 2> /dev/null isn't even necessary -- the
>>> warnings and logs should not be piped along with the normal content.)
>>>
>>> Otherwise, IIRC, there is no way to disable the first illegal
>>> reflective access warning when running in Java 9+, but you can "fix"
>>> these module errors, and deactivate the NativeCodeLoader logs with an
>>> explicit log4j.properties:
>>>
>>> java -Dlog4j.configuration=file:///tmp/log4j.properties --add-opens
>>> java.security.jgss/sun.security.krb5=ALL-UNNAMED -jar
>>> ~/.m2/repository/org/apache/avro/avro-tools/1.9.1/avro-tools-1.9.1.jar
>>> tojson x.out
>>
>>
>> Thanks for that suggestion! I'm afraid I'm not familiar with log4j 
>> properties files though. What do I need to put in /tmp/log4j.properties to 
>> make this work?
>>
>>> None of that is particularly satisfactory, but it could be a
>>> workaround for your immediate use.
>>
>>
>> Yeah, not ideal, because if something goes wrong, stdout will be corrupted, 
>> but at least some noise should go away :)
>>
>>> I'd also like to see a more unified experience with the CLI tool for
>>> documentation and usage.  The current state requires a bit of Avro
>>> expertise to use, but it has some functions that would be pretty
>>> useful for a user working with Avro data.  I raised
>>> https://issues.apache.org/jira/browse/AVRO-2688 as an improvement.
>>>
>>> In my opinion, a schema compatibility tool would be a useful and
>>> welcome feature!
>>
>>
>> That would indeed be nice, but in the meantime, is there really nothing in 
>> the avro-tools commands that uses a chosen schema to read a data file 
>> written with some other schema? That would give me what I'm after currently.
>>
>> Thanks again for the helpful response.
>>
>>cheers,
>>  rog.
>>
>>>
>>> Best regards, Ryan
>>>
>>>
>>>
>>> On Thu, Jan 16, 2020 at 12:25 PM roger peppe  wrote:
>>> >
>>> > Hi Fokko,
>>> >
>>> > Thanks for your swift response!
>>> >
>>> > Stdout and stderr definitely seem to be merged on this platform at least. 
>>> > Here's a sample:
>>> >
>>> > % avrotool random --count 1 --schema '"int"'  x.out
>>> > % avrotool tojson x.out > x.json
>>> > % cat x.json
>>> > 125140891
&g

Re: avro-tools illegal reflective access warnings

2020-01-16 Thread Ryan Skraba
Hello!  Is it because you are using brew to install avro-tools?  I'm
not entirely familiar with how it packages the command, but using a
direct bash-like solution instead might solve this problem of mixing
stdout and stderr.  This could be the simplest (and right) solution
for piping.

alias avrotoolx='java -jar
~/.m2/repository/org/apache/avro/avro-tools/1.9.1/avro-tools-1.9.1.jar'
avrotoolx tojson x.out 2> /dev/null

(As Fokko mentioned, the 2> /dev/null isn't even necessary -- the
warnings and logs should not be piped along with the normal content.)

Otherwise, IIRC, there is no way to disable the first illegal
reflective access warning when running in Java 9+, but you can "fix"
these module errors, and deactivate the NativeCodeLoader logs with an
explicit log4j.properties:

java -Dlog4j.configuration=file:///tmp/log4j.properties --add-opens
java.security.jgss/sun.security.krb5=ALL-UNNAMED -jar
~/.m2/repository/org/apache/avro/avro-tools/1.9.1/avro-tools-1.9.1.jar
tojson x.out

None of that is particularly satisfactory, but it could be a
workaround for your immediate use.

I'd also like to see a more unified experience with the CLI tool for
documentation and usage.  The current state requires a bit of Avro
expertise to use, but it has some functions that would be pretty
useful for a user working with Avro data.  I raised
https://issues.apache.org/jira/browse/AVRO-2688 as an improvement.

In my opinion, a schema compatibility tool would be a useful and
welcome feature!

Best regards, Ryan



On Thu, Jan 16, 2020 at 12:25 PM roger peppe  wrote:
>
> Hi Fokko,
>
> Thanks for your swift response!
>
> Stdout and stderr definitely seem to be merged on this platform at least. 
> Here's a sample:
>
> % avrotool random --count 1 --schema '"int"'  x.out
> % avrotool tojson x.out > x.json
> % cat x.json
> 125140891
> WARNING: An illegal reflective access operation has occurred
> WARNING: Illegal reflective access by 
> org.apache.hadoop.security.authentication.util.KerberosUtil 
> (file:/home/rog/other/avro-tools-1.9.1.jar) to method 
> sun.security.krb5.Config.getInstance()
> WARNING: Please consider reporting this to the maintainers of 
> org.apache.hadoop.security.authentication.util.KerberosUtil
> WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations
> WARNING: All illegal access operations will be denied in a future release
> 20/01/16 11:00:37 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> %
>
> I've just verified that it's not a problem with the java executable itself (I 
> ran a program that printed to System.err and the text correctly goes to the 
> standard error).
>
> > Regarding the documentation, the CLI itself contains info on all the 
> > available commands. Also, there are excellent online resources: 
> > https://www.michael-noll.com/blog/2013/03/17/reading-and-writing-avro-files-from-the-command-line/
> >  Is there anything specific that you're missing?
>
> There's the single line summary produced for each command by running 
> "avro-tools" with no arguments, but that's not as much info as I'd ideally 
> like. For example, it often doesn't say what file format is being written or 
> read. For some commands, the purpose is not very clear.
>
> For example the description of the recodec command is "Alters the codec of a 
> data file". It doesn't describe how it alters it or how one might configure 
> the alteration parameters. I managed to get some usage help by passing it 
> more than two parameters (specifying "--help" gives an exception), but that 
> doesn't provide much more info:
>
> % avro-tools recodec a b c
> Expected at most an input file and output file.
> Option Description
> -- ---
> --codecCompression codec (default: null)
> --level   Compression level (only applies to deflate and xz) 
> (default:
>  -1)
>
> For the record, I'm wondering it might be possible to get avrotool to tell me 
> if one schema is compatible with another so that I can check hypotheses about 
> schema-checking in practice without having to write Java code.
>
>   cheers,
> rog.
>
>
> On Thu, 16 Jan 2020 at 10:30, Driesprong, Fokko  wrote:
>>
>> Hi Rog,
>>
>> This is actually a warning produced by the Hadoop library, that we're using. 
>> Please note that htis isn't part of the stdout:
>>
>> $ find /tmp/tmp
>> /tmp/tmp
>> /tmp/tmp/._SUCCESS.crc
>> /tmp/tmp/part-0-9300fba6-ccdd-4ecc-97cb-0c3ae3631be5-c000.avro
>> /tmp/tmp/.part-0-9300fba6-ccdd-4ecc-97cb-0c3ae3631be5-c000.avro.crc
>> /tmp/tmp/_SUCCESS
>>
>> $ avro-tools tojson 
>> /tmp/tmp/part-0-9300fba6-ccdd-4ecc-97cb-0c3ae3631be5-c000.avro
>> 20/01/16 11:26:10 WARN util.NativeCodeLoader: Unable to load native-hadoop 
>> library for your platform... using builtin-java classes where applicable
>> {"line_of_text":{"string":"Hello"}}
>> 

Re: name-agnostic schema resolution (a.k.a. structural subtyping?)

2019-12-19 Thread Ryan Skraba
Hello!  You might be interested in this short discussion on the dev@
mailing list: 
https://lists.apache.org/x/thread.html/dd7a23c303ef045c124050d7eac13356b20551a6a663a79cb8807f41@%3Cdev.avro.apache.org%3E

In short, it appears that the record name is already ignored in
record-to-record matching (at least outside of unions) as an
implementation detail in Java.  I never *did* get around to verifying
the behaviour of the other language implementations, but if this is
what is being done in practice, it's worth clarifying in the
specification.

It does seems like a very pragmatic thing to do, and would help with
the CloudEvents Avro use case.  It would be a nice recipe to share in
the docs: the right way to read an envelope from a custom message when
you don't need the payload.

I'm not sure I understand the third strategy, however!  There aren't
any names in binary data when writing - what would the alias do?

(Also, I largely prefer your avro version with explicitly typed
metadata fields and names as well!)

All my best, Ryan

On Wed, Dec 18, 2019 at 5:49 PM roger peppe  wrote:
>
> Hi,
>
> Background: I've been contemplating the proposed Avro format in the 
> CloudEvent specification, which defines standard metadata for events. It 
> defines a very generic format for an event that allows storage of almost any 
> data. It seems to me that by going in that direction it's losing almost all 
> the advantages of using Avro in the first place. It feels like it's trying to 
> shoehorn a dynamic message format like JSON into the Avro format, where using 
> Avro itself could do so much better.
>
> I'm hoping to propose something better. I had what I thought was a nice idea, 
> but it doesn't quite work, and I thought I'd bring up the subject here and 
> see if anyone had some better ideas.
>
> The schema resolution part of the spec allows a reader to read a schema that 
> was written with extra fields. So, theoretically, we could define a 
> CloudEvent something like this:
>
> { "name": "CloudEvent", "type": "record", "fields": [{ "name": "Metadata", 
> "type": { "type": "record", "name": "CloudEvent", "namespace": 
> "avro.apache.org", "fields": [{ "name": "id", "type": "string" }, { "name": 
> "source", "type": "string" }, { "name": "time", "type": "long", 
> "logicalType": "timestamp-micros" }] } }] }
>
> Theoretically, this could enable any event that's a record that has at least 
> a Metadata field with the above fields to be read generically. The CloudEvent 
> type above could be seen as a structural supertype of all possible 
> more-specific CloudEvent-compatible records that have such a compatible field.
>
> This has a few nice advantages:
> - there's no need for any wrapping of payload data.
> - the CloudEvent type can evolve over time like any other Avro type.
> - all the data message fields are immediately available alongside the 
> metadata.
> - there's still exactly one schema for a topic, encapsulating both the 
> metadata and the payload.
>
> However, this idea fails because of one problem - this schema resolution 
> rule: "both schemas are records with the same (unqualified) name". This means 
> that unless everyone names all their CloudEvent-compatible records 
> "CloudEvent", they can't be read like this.
>
> I don't think people will be willing to name all their records "CloudEvent", 
> so we have a problem.
>
> I can see a few possible workarounds:
>
> when reading the record as a CloudEvent, read it with a schema that's the 
> same as CloudEvent, but with the top level record name changed to the top 
> level name of the schema that was used to write the record.
> ignore record names when matching schema record types.
> allow aliases to be specified when writing data as well as reading it. When 
> defining a CloudEvent-compatible event, you'd add a CloudEvent alias to your 
> record.
>
> None of the options are particularly nice. 1 is probably the easiest to do, 
> although means you'd still need some custom logic when decoding records, 
> meaning you couldn't use stock decoders.
>
> I like the idea of 2, although it gets a bit tricky when dealing with union 
> types. You could define the matching such that it ignores names only when the 
> two matched types are unambiguous (i.e. only one record in both). This could 
> be implemented as an option ("use structural typing") when decoding.
>
> 3 is probably cleanest but interacts significantly with the spec (for 
> example, the canonical schema transformation strips aliases out, but they'd 
> need to be retained).
>
> Any thoughts? Is this a silly thing to be contemplating? Is there a better 
> way?
>
>   cheers,
> rog.
>


Re: New Committer: Ryan Skraba

2019-12-17 Thread Ryan Skraba
Thanks so much!  I'm super impressed with the quality of the work and
advancement I've seen here, and I'm pretty excited and grateful to be
able to contribute!

Ryan


On Tue, Dec 17, 2019 at 1:21 PM Austin Cawley-Edwards
 wrote:
>
> Congrats Ryan, thanks for the help so far!
>
>
> Austin
>
> On Tue, Dec 17, 2019 at 7:13 AM Michael Burr  wrote:
>>
>> unsubscribe
>>
>> On Tue, Dec 17, 2019 at 4:43 AM Driesprong, Fokko  
>> wrote:
>>>
>>> Folks,
>>>
>>> The Project Management Committee (PMC) for Apache Avro has invited Ryan 
>>> Skraba to become a committer and we are pleased to announce that he has 
>>> accepted. Ryan is actively fixing bugs by providing patches and reviewing 
>>> pull requests by others. We're very happy to have him on board.
>>>
>>> Being a committer enables easier contribution to the project since there is 
>>> no need to go via the patch submission process. This should enable better 
>>> productivity.
>>>
>>> Please join me in congratulating Ryan on his recognition of great work thus 
>>> far in our community.
>>>
>>> Cheers, Fokko


Re: records with without fields?

2019-12-17 Thread Ryan Skraba
Related to the earlier question in the thread: there's one good
starting point for a language-agnostic set of test schemas here:
https://github.com/apache/avro/blob/master/share/test/data/schema-tests.txt#L24

There's a LOT of other schemas scattered throughout the project and
languages, of course. It would be super (but probably a bit magical)
if there were a common reference that could be reused across all
languages!  From the user/community point of view, I would especially
find it interesting to have these durable, linkable examples in the
docs to use as references.

On Sun, Dec 15, 2019 at 12:54 PM Vance Duncan  wrote:
>
> Yes, you’re right. It’s true of any record. It’s rare I end up with no 
> required field for a record. The idea of a metadata container - such as a 
> custom “extension” record to a base record - does sound like a meaningful use 
> case.
>
> On Sat, Dec 14, 2019 at 3:50 AM roger peppe  wrote:
>>
>> On Sat, 14 Dec 2019 at 04:58, Vance Duncan  wrote:
>>>
>>> Because you will forever be limited to adding nullable fields to that 
>>> record. It will forever be a weak contract. That may be OK, depending on 
>>> the situation. You just won’t be able to enforce semantics through the 
>>> schema. The code will have to enforce all constraints.
>>
>>
>> As I understand it, that's true when adding fields to any record over time, 
>> whether the record starts off with ten fields or none.
>>
>> As you say, it may be OK in some situations, so ISTM that it shouldn't be 
>> forbidden by the specification.
>>
>> One possible concrete use case for an empty record is to reserve a field in 
>> a record for future extensible use; for metadata, for example.
>> This could be somewhat nicer than using a map type because you get the 
>> capability to add specific fields.
>> Perhaps the OCF format could have done something like this for the file 
>> metadata field.
>>
>>   cheers,
>> rog.
>>
>>>
>>> On Fri, Dec 13, 2019 at 6:28 PM roger peppe  wrote:



 On Fri, 13 Dec 2019 at 23:08, Vance Duncan  wrote:
>
> Sorry about that. I was assuming some kind of name-based schema registry 
> lookup. Assume you are looking up schemas by name using a schema 
> registry. Let’s say the record is name MyRecord. You subsequently add a 
> required field to it. Since the new record is not reverse compatible, 
> you’ll need to name it MyRecord2, or whatever. This is what I meant by 
> “reidentify”.


 I don't quite get how this is different to having a struct with any other 
 number of fields. Why should zero be special here?

>
> On Fri, Dec 13, 2019 at 12:46 PM roger peppe  wrote:
>>
>>
>>
>> On Fri, 13 Dec 2019 at 15:02, Vance Duncan  wrote:
>>>
>>> My immediate thought is observe the YAGNI principle and only create it 
>>> if and when you need it. Otherwise, you run the risk of requiring 
>>> non-interchangeable re-identification if you need required, 
>>> non-default, fields when the need materializes.
>>
>>
>> Could you expand a little on that latter point, please? I'm not sure I 
>> understand what you're saying.
>> A concrete example might help.
>>
>>   cheers,
>> rog.
>>>
>>>
>>>
>>> On December 13, 2019, at 9:25 AM, roger peppe  
>>> wrote:
>>>
>>>
>>> Hi,
>>>
>>> The specification doesn't seem to make it entirely clear whether it's 
>>> allowable for a record to contain no fields (a zero-length array for 
>>> the fields member). I've found at least one implementation that 
>>> complains about a record with an empty fields array, and I'm wondering 
>>> if this is a bug.
>>>
>>> A record containing no fields is actually quite useful as it can act as 
>>> a placeholder for a record with any number of extra fields in future 
>>> evolutions of a schema.
>>>
>>> What do you think?
>>>
>>>   cheers,
>>> rog.
>
> --
> Regards,
>
> Vance Duncan
> mailto:dunca...@gmail.com
> http://www.linkedin.com/in/VanceDuncan
> (904) 553-5582
>>>
>>> --
>>> Regards,
>>>
>>> Vance Duncan
>>> mailto:dunca...@gmail.com
>>> http://www.linkedin.com/in/VanceDuncan
>>> (904) 553-5582
>
> --
> Regards,
>
> Vance Duncan
> mailto:dunca...@gmail.com
> http://www.linkedin.com/in/VanceDuncan
> (904) 553-5582


Re: records with without fields?

2019-12-13 Thread Ryan Skraba
I think the spec is OK with it.  We've even used it in the Java API
(to refer to a table had been created but had no columns yet).  It's
not *extremely* useful even as a starting point to add schema
evolutions, but maybe as a technique for forcing different Parsing
Canonical Forms for otherwise identical schemas?  The no-field record
wouldn't be stripped, but still serializes down to zero binary bytes.

We actually ran into the following problem: how many records (of size
zero) can you decode from an empty stream of bytes?   If I remember
correctly, the Java API will happily read zero-byte records forever,
so if you're going to use this technique, make sure you have a
stopping condition!

Ryan

On Fri, Dec 13, 2019 at 4:02 PM Vance Duncan  wrote:
>
> My immediate thought is observe the YAGNI principle and only create it if and 
> when you need it. Otherwise, you run the risk of requiring 
> non-interchangeable re-identification if you need required, non-default, 
> fields when the need materializes.
>
>
>
> On December 13, 2019, at 9:25 AM, roger peppe  wrote:
>
>
> Hi,
>
> The specification doesn't seem to make it entirely clear whether it's 
> allowable for a record to contain no fields (a zero-length array for the 
> fields member). I've found at least one implementation that complains about a 
> record with an empty fields array, and I'm wondering if this is a bug.
>
> A record containing no fields is actually quite useful as it can act as a 
> placeholder for a record with any number of extra fields in future evolutions 
> of a schema.
>
> What do you think?
>
>   cheers,
> rog.


Re: Resolving a possible specification inconsistency pertaining to the doc attribute

2019-12-10 Thread Ryan Skraba
@Roger: The CUE schema gets a +1 for the most accurate regex for
validating names and namespaces so far! :D  It doesn't look like it's
being applied to *every* name and namespace attribute though, or am I
misreading?  I read the schema with just a *minimal* understanding of
the language, but it looks like it also expects that fixed data can
have a doc.

I would hope that the doc attribute in a fixed data schema could still
be retrieved like any other metadata by schema.getObjectProp (at least
in the Java API).  I'll check!

@Jonah: I think I understand your use case a bit better -- thanks for
the clarification!

Attributes outside of the spec should be OK to use as metadata, and
that seems like the right fit for your use case (such as the
interesting obfuscation attribute in lenses).  Are the avro tools that
strip non-spec-attributes/metadata doing something wrong?  I can see
this happening if they are relying on the Parsing Canonical Form or
the fingerprint (based on canonical form), but that is deliberate to
remove all differences between two schemas that can be used to parse
the same binary data.  Note that PCF also removes doc attributes.

Is there code in the avro project that is manipulating schemas and
stripping metadata silently?  I would consider that a bug.  For
external tools, it could either be a bug or undocumented behaviour.

All my best, Ryan

On Mon, Dec 9, 2019 at 5:14 PM roger peppe  wrote:
>
> Somewhat relevant, here is a CUE schema for Avro schemas that I wrote a 
> little while ago that can be used to check Avro schema compliance to a degree 
> (if you haven't heard of CUE, there's a bunch of info on it at cuelang.org).
>
> My understanding of Avro was somewhat less then, so it's probably wrong in 
> parts, and it's definitely not a strict as it could be, but I've found it 
> useful, and it has lots of room for improvement.
>
>   cheers,
> rog.
>
>
>
> On Fri, 6 Dec 2019 at 17:43, Jonah H. Harris  wrote:
>>
>> On Fri, Dec 6, 2019 at 12:16 PM Ryan Skraba  wrote:
>>>
>>> Hello!  Yes, it looks like `fixed` is the only named complex type that
>>> doesn't have a doc attribute.  No primitive types have the doc
>>> attribute.
>>>
>>> This might be an omission, but I don't think it's inconsistent.  In my
>>> experience, there's no compelling reason to document schemas of
>>> primitive types, but a good practice for the fields or container types
>>> that they're inside.  Fixed is not a primitive type, but in practice
>>> it's used like bytes (which is).
>>
>>
>> Hey, Ryan. Thanks for getting back to me so quickly.
>>
>> Yeah. I don't think primitive types need the doc attribute. As fixed is 
>> complex and can be an independent type, however, I thought that was 
>> inconsistent with the other complex types.
>>
>>>
>>> In my opinion, I wouldn't consider it important to make the doc
>>> attribute universal on any type/field, but I wouldn't have any strong
>>> objection if that were the consensus.  Today, I'm pretty sure that the
>>> Java implementation corresponds to the spec with regards to the doc
>>> attribute.
>>
>>
>> Agreed.
>>
>>>
>>> As a minimum, I'd propose that the only action here is to change the
>>> IDL guide: "Comments that begin with /** are used as the documentation
>>> string (if applicable) for the type or field definition that follows
>>> the comment."
>>>
>>> Is this what you're looking for?
>>
>>
>> Yes. We're actually using the doc string to store not only a textual 
>> description of the field/type, but also a set of annotations used for event 
>> storage and data masking. The main reason we wanted doc to be consistent for 
>> all complex types (including fixed) is that it permits us to easily tell 
>> what complex objects can exist across the ecosystem directly from our schema 
>> repository. Initially, we wanted to use a separate internal attribute 
>> (similar to the lenses obfuscate attribute approach -- 
>> https://docs.lenses.io/2.0/install_setup/datagovernance/index.html#data-anonymization
>>  -- but we've found several Avro tools strip out all non-spec-compliant 
>> attributes. This leaves us only the doc field.
>>
>>> P.S. I'm very intrigued by the "thorough schema compliance checker"!
>>> Is this something that would be shared? Would it help find other
>>> inconsistencies in the Avro spec and implementations?
>>
>>
>> Yes, this will be open-sourced.
>>
>> --
>> Jonah H. Harris
>>


Re: defaults for complex types (was Re: recursive types)

2019-12-06 Thread Ryan Skraba
Hello!   I had a Java unit test ready to go (looking at default values
for complex types for AVRO-2636), so just reporting back (the easy
work!):

1. In Java, the schema above is parsed without error, but when
attempting to use the default value, it fails with a
NullPointerException (trying to find the symbol C in E1).

2. If you were to disambiguate the symbols using the Avro JSON
encoding ("default": [{"E1":"B"},{"E2":"A"},{"E2":"C"}]), Java fails
while parsing the schema:

org.apache.avro.AvroTypeException: Invalid default for field F:
[{"E1":"B"},{"E2":"A"},{"E2":"C"}] not a
{"type":"array","items":[{"type":"enum","name":"E1","symbols":["A","B"]},{"type":"enum","name":"E2","symbols":["B","A","C"]}]}
at org.apache.avro.Schema.validateDefault(Schema.java:1542)
at org.apache.avro.Schema.access$500(Schema.java:87)
at org.apache.avro.Schema$Field.(Schema.java:523)
at org.apache.avro.Schema.parse(Schema.java:1649)
at org.apache.avro.Schema$Parser.parse(Schema.java:1396)
at org.apache.avro.Schema$Parser.parse(Schema.java:1384)

It seems that Java implements `Only the first schema in any union can
be used in a default value` as opposed to `Default values for union
fields correspond to the first schema in the union` (in the example,
it isn't a union field).

Naively, I would expect any JSON encoded data to be a valid default
value (which is not what the spec says).  Does anyone know why the
"first schema only" rule was added to the spec?

Best regards, Ryan



On Thu, Dec 5, 2019 at 7:01 PM Lee Hambley  wrote:
>
> Hi Rog,
>
> Glad my pointers were useful, the Avro spec really is a marvel.
>
> Regarding your follow-up question, I'm honestly not sure, interesting 
> contrived example however, and interesting that no matter how well written 
> the spec is, it can still be ambiguous.
>
> I found this snipped in the 1.9x docs, where I know there was some changes to 
> defaults for complex types, the 1.8 docs may be incomplete in that regard. ( 
> https://avro.apache.org/docs/1.9.0/spec.html#schema_complex )
>
>> Default values for union fields correspond to the first schema in the union. 
>> Default values for bytes and fixed fields are JSON strings, where Unicode 
>> code points 0-255 are mapped to unsigned 8-bit byte values 0-255.
>
>
> I take `Default values for union fields correspond to the first schema in the 
> union` to mean that your default including values from the 2nd schema in the 
> union is invalid, *or* that where the member exists in the first union it 
> refers to the first union, and when not, it refers to the first schema in 
> which it _does_ exist.
>
> One way to find out would be to run some data through a couple of common 
> implementations, and see how they handle the resulting data, and, maybe feed 
> that back into Avro docs in the form of a PR if you come up with something 
> useful?
>
> Either way, I'm curious now! Let me know when you have an answer?
>
> Cheers,
>
> Lee Hambley
> http://lee.hambley.name/
> +49 (0) 170 298 5667
>
>
> On Thu, 5 Dec 2019 at 14:07, roger peppe  wrote:
>>
>> On Wed, 4 Dec 2019 at 11:38, Lee Hambley  wrote:
>>>
>>> HI Rog,
>>>
>>> Good question, the answer lay in the docs in the "Parsing Canonical Form 
>>> for Schemas" where it states (amongst all the other transformation rules)
>>>
 [ORDER] Order the appearance of fields of JSON objects as follows: name, 
 type, fields, symbols, items, values, size. For example, if an object has 
 type, name, and size fields, then the name field should appear first, 
 followed by the type and then the size fields.
>>>
>>>
>>> (emphasis mine)
>>>
>>> The canonical form for schemas becomes more relevant to Avro usage when 
>>> working with a schema registry for e.g, but it's a really common use-case 
>>> and I consider definition of a canonical form for schema comparisons to be 
>>> a strength of Avro compared with other serialization formats.
>>>
>>> - 
>>> https://avro.apache.org/docs/1.8.2/spec.html#Parsing+Canonical+Form+for+Schemas
>>
>>
>> Thanks very much - I'd missed that, very helpful!
>>
>> Maybe you might be able to help with another part of the spec that I've been 
>> puzzling over too: default values for complex types.
>> The spec doesn't seem to say how unions in complex types are specified when 
>> in default values.
>>
>> For example, consider the following schema:
>>
>> {
>> "type": "record",
>> "name": "R",
>> "fields": [
>> {
>> "name": "F",
>> "type": {
>> "type": "array",
>> "items": [
>> {
>> "type": "enum",
>> "name": "E1",
>> "symbols": ["A", "B"]
>> },
>> {
>> "type": "enum",
>> "name": "E2",
>> "symbols": ["B", "A", "C"]
>> }
>> ]
>> },
>> "default": 

Re: Avro schema having Map of Records

2019-08-06 Thread Ryan Skraba
Funny, I'm familiar with Avro, but I'm currently looking closely at Parquet!

Interestingly enough, I just ran across the conversion utilities in
Spark that could have answered your original question[1].

It looks like you're using ReflectData to get the schema.  Is the
exception occurring during the ReflectData.getSchema() or .induce() ?
Can you share the full stack trace or better yet, the POJO that
reproduces the error?

I _think_ I may have ran across something similar when getting a
schema via reflection, but the class had a raw collection field (List
instead of List).  I can't clearly recall, but that might be
a useful hint.

[1]: 
https://github.com/apache/spark/blob/master/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala#L136

On Tue, Aug 6, 2019 at 2:39 PM Edgar H  wrote:
>
> Thanks a lot for the quick reply Ryan! That was exactly what I was looking 
> for :)
>
> Been trying including the changes within my code and currently it's throwing 
> the following exception... Caused by: org.apache.avro.AvroRuntimeException: 
> Can't find element type of Collection
>
> I'm thinking that it could be the POJO not containing the classes for the 
> inner record fields (I just have a getter and setter for the one_level field 
> but the rest are types of that one)? Or how should it be represented within 
> the parent POJO?
>
> Sorry if the questions sound too simple, but I'm too used to work with 
> Parquet that Avro seems like a shift from time to time :)
>
> El mar., 6 ago. 2019 a las 12:01, Ryan Skraba () escribió:
>>
>> Hello -- Avro supports a map type:
>> https://avro.apache.org/docs/1.9.0/spec.html#Maps
>>
>> Generating an Avro schema from a JSON example can be ambiguous since a
>> JSON object can either be converted to a record or a map.  You're
>> probably looking for something like this:
>>
>> {
>>   "type" : "record",
>>   "name" : "MyClass",
>>   "namespace" : "com.acme.avro",
>>   "fields" : [ {
>> "name" : "one_level",
>> "type" : {
>>   "type" : "record",
>>   "name" : "one_level",
>>   "fields" : [ {
>> "name" : "inner_level",
>> "type" : {
>>   "type" : "map",
>>   "values" : {
>> "type" : "record",
>> "name" : "sample",
>> "fields" : [ {
>>   "name" : "sample1",
>>   "type" : "string"
>> }, {
>>   "name" : "sample2",
>>   "type" : "string"
>> } ]
>>   }
>> }
>>   } ]
>> }
>>   } ]
>> }
>>
>> On Tue, Aug 6, 2019 at 10:47 AM Edgar H  wrote:
>> >
>> > I'm trying to translate a schema that I have in Spark which is defined for 
>> > Parquet, and I would like to use it within Avro too.
>> >
>> >   StructField("one_level", StructType(List(StructField(
>> > "inner_level",
>> > MapType(
>> >   StringType,
>> >   StructType(
>> > List(
>> >   StructField("field1", StringType),
>> >   StructField("field2", ArrayType(StringType))
>> > )
>> >   )
>> > )
>> >   )
>> > )), nullable = false)
>> >
>> > However, in Avro I haven't seen any examples of Maps containing Record 
>> > type objects...
>> >
>> > Tried a sample input with an online Avro schema generator, taking this 
>> > input.
>> >
>> > {
>> > "one_level": {
>> > "inner_level": {
>> > "sample1": {
>> > "field1": "sample",
>> > "field2": ["a", "b"],
>> > },
>> > "sample2": {
>> > "field1": "sample2",
>> > "field2": ["a", "b"]
>> > }
>> > }
>> > }
>> >
>> > }
>> >
>> > It prompts this output.
>> >
>> > {
>> >   "name": "MyClass",
>> >   "type": "record",
>> >  

Re: Avro schema having Map of Records

2019-08-06 Thread Ryan Skraba
Hello -- Avro supports a map type:
https://avro.apache.org/docs/1.9.0/spec.html#Maps

Generating an Avro schema from a JSON example can be ambiguous since a
JSON object can either be converted to a record or a map.  You're
probably looking for something like this:

{
  "type" : "record",
  "name" : "MyClass",
  "namespace" : "com.acme.avro",
  "fields" : [ {
"name" : "one_level",
"type" : {
  "type" : "record",
  "name" : "one_level",
  "fields" : [ {
"name" : "inner_level",
"type" : {
  "type" : "map",
  "values" : {
"type" : "record",
"name" : "sample",
"fields" : [ {
  "name" : "sample1",
  "type" : "string"
}, {
  "name" : "sample2",
  "type" : "string"
} ]
  }
}
  } ]
}
  } ]
}

On Tue, Aug 6, 2019 at 10:47 AM Edgar H  wrote:
>
> I'm trying to translate a schema that I have in Spark which is defined for 
> Parquet, and I would like to use it within Avro too.
>
>   StructField("one_level", StructType(List(StructField(
> "inner_level",
> MapType(
>   StringType,
>   StructType(
> List(
>   StructField("field1", StringType),
>   StructField("field2", ArrayType(StringType))
> )
>   )
> )
>   )
> )), nullable = false)
>
> However, in Avro I haven't seen any examples of Maps containing Record type 
> objects...
>
> Tried a sample input with an online Avro schema generator, taking this input.
>
> {
> "one_level": {
> "inner_level": {
> "sample1": {
> "field1": "sample",
> "field2": ["a", "b"],
> },
> "sample2": {
> "field1": "sample2",
> "field2": ["a", "b"]
> }
> }
> }
>
> }
>
> It prompts this output.
>
> {
>   "name": "MyClass",
>   "type": "record",
>   "namespace": "com.acme.avro",
>   "fields": [
> {
>   "name": "one_level",
>   "type": {
> "name": "one_level",
> "type": "record",
> "fields": [
>   {
> "name": "inner_level",
> "type": {
>   "name": "inner_level",
>   "type": "record",
>   "fields": [
> {
>   "name": "sample1",
>   "type": {
> "name": "sample1",
> "type": "record",
> "fields": [
>   {
> "name": "field1",
> "type": "string"
>   },
>   {
> "name": "field2",
> "type": {
>   "type": "array",
>   "items": "string"
> }
>   }
> ]
>   }
> },
> {
>   "name": "sample2",
>   "type": {
> "name": "sample2",
> "type": "record",
> "fields": [
>   {
> "name": "field1",
> "type": "string"
>   },
>   {
> "name": "field2",
> "type": {
>   "type": "array",
>   "items": "string"
> }
>   }
> ]
>   }
> }
>   ]
> }
>   }
> ]
>   }
> }
>   ]
> }
>
> Which isn't absolutely what I'm looking for. Is it possible to define such 
> schema in Avro?


Re: AVRO schema evolution: adding optional column with default fails deserialization

2019-08-02 Thread Ryan Skraba
hus using the schema 
>>>>> registry again you can get the writer schema.
>>>>>
>>>>> /Svante
>>>>>
>>>>> On Thu, Aug 1, 2019, 15:30 Martin Mucha  wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> just one more question, not strictly related to the subject.
>>>>>>
>>>>>> Initially I though I'd be OK with using some initial version of schema 
>>>>>> in place of writer schema. That works, but all columns from schema older 
>>>>>> than this initial one would be just ignored. So I need to know EXACTLY 
>>>>>> the schema, which writer used. I know, that avro messages contains 
>>>>>> either full schema or at least it's ID. Can you point me to the 
>>>>>> documentation, where this is discussed? So in my deserializer I have 
>>>>>> byte[] as a input, from which I need to get the schema information 
>>>>>> first, in order to be able to deserialize the record. I really do not 
>>>>>> know how to do that, I'm pretty sure I never saw this anywhere, and I 
>>>>>> cannot find it anywhere. But in principle it must be possible, since 
>>>>>> reader need not necessarily have any control of which schema writer used.
>>>>>>
>>>>>> thanks a lot.
>>>>>> M.
>>>>>>
>>>>>> út 30. 7. 2019 v 18:16 odesílatel Martin Mucha  
>>>>>> napsal:
>>>>>>>
>>>>>>> Thank you very much for in depth answer. I understand how it works now 
>>>>>>> better, will test it shortly.
>>>>>>> Thank you for your time.
>>>>>>>
>>>>>>> Martin.
>>>>>>>
>>>>>>> út 30. 7. 2019 v 17:09 odesílatel Ryan Skraba  napsal:
>>>>>>>>
>>>>>>>> Hello!  It's the same issue in your example code as allegro, even with
>>>>>>>> the SpecificDatumReader.
>>>>>>>>
>>>>>>>> This line: datumReader = new SpecificDatumReader<>(schema)
>>>>>>>> should be: datumReader = new SpecificDatumReader<>(originalSchema, 
>>>>>>>> schema)
>>>>>>>>
>>>>>>>> In Avro, the original schema is commonly known as the writer schema
>>>>>>>> (the instance that originally wrote the binary data).  Schema
>>>>>>>> evolution applies when you are using the constructor of the
>>>>>>>> SpecificDatumReader that takes *both* reader and writer schemas.
>>>>>>>>
>>>>>>>> As a concrete example, if your original schema was:
>>>>>>>>
>>>>>>>> {
>>>>>>>>   "type": "record",
>>>>>>>>   "name": "Simple",
>>>>>>>>   "fields": [
>>>>>>>> {"name": "id", "type": "int"},
>>>>>>>> {"name": "name","type": "string"}
>>>>>>>>   ]
>>>>>>>> }
>>>>>>>>
>>>>>>>> And you added a field:
>>>>>>>>
>>>>>>>> {
>>>>>>>>   "type": "record",
>>>>>>>>   "name": "SimpleV2",
>>>>>>>>   "fields": [
>>>>>>>> {"name": "id", "type": "int"},
>>>>>>>> {"name": "name", "type": "string"},
>>>>>>>> {"name": "description","type": ["null", "string"]}
>>>>>>>>   ]
>>>>>>>> }
>>>>>>>>
>>>>>>>> You could do the following safely, assuming that Simple and SimpleV2
>>>>>>>> classes are generated from the avro-maven-plugin:
>>>>>>>>
>>>>>>>> @Test
>>>>>>>> public void testSerializeDeserializeEvolution() throws IOException {
>>>>>>>>   // Write a Simple v1 to bytes using your exact method.
>>

Re: AVRO schema evolution: adding optional column with default fails deserialization

2019-07-30 Thread Ryan Skraba
> throw new SerializationException("Error deserializing data", ex);
> }
> }
>
> serializer:
> public static  byte[] serialize(T data, boolean 
> useBinaryDecoder, boolean pretty) {
> try {
> if (data == null) {
> return new byte[0];
> }
>
> log.debug("data='{}'", data);
> Schema schema = data.getSchema();
> ByteArrayOutputStream byteArrayOutputStream = new 
> ByteArrayOutputStream();
> Encoder binaryEncoder = useBinaryDecoder
> ? 
> EncoderFactory.get().binaryEncoder(byteArrayOutputStream, null)
> : EncoderFactory.get().jsonEncoder(schema, 
> byteArrayOutputStream, pretty);
>
> DatumWriter datumWriter = new 
> GenericDatumWriter<>(schema);
> datumWriter.write(data, binaryEncoder);
>
> binaryEncoder.flush();
> byteArrayOutputStream.close();
>
> byte[] result = byteArrayOutputStream.toByteArray();
> log.debug("serialized data='{}'", 
> DatatypeConverter.printHexBinary(result));
> return result;
> } catch (IOException ex) {
> throw new SerializationException(
> "Can't serialize data='" + data, ex);
> }
> }
>
> út 30. 7. 2019 v 13:48 odesílatel Ryan Skraba  napsal:
>>
>> Hello!  Schema evolution relies on both the writer and reader schemas
>> being available.
>>
>> It looks like the allegro tool you are using is using the
>> GenericDatumReader that assumes the reader and writer schema are the
>> same:
>>
>> https://github.com/allegro/json-avro-converter/blob/json-avro-converter-0.2.8/converter/src/main/java/tech/allegro/schema/json2avro/converter/JsonAvroConverter.java#L83
>>
>> I do not believe that the "default" value is taken into account for
>> data that is strictly missing from the binary input, just when a field
>> is known to be in the reader schema but missing from the original
>> writer.
>>
>> You may have more luck reading the GenericRecord with a
>> GenericDatumReader with both schemas, and using the
>> `convertToJson(record)`.
>>
>> I hope this is useful -- Ryan
>>
>>
>>
>> On Tue, Jul 30, 2019 at 10:20 AM Martin Mucha  wrote:
>> >
>> > Hi,
>> >
>> > I've got some issues/misunderstanding of AVRO schema evolution.
>> >
>> > When reading through avro documentation, for example [1], I understood, 
>> > that schema evolution is supported, and if I added column with specified 
>> > default, it should be backwards compatible (and even forward when I remove 
>> > it again). Sounds great, so I added column defined as:
>> >
>> > {
>> >   "name": "newColumn",
>> >   "type": ["null","string"],
>> >   "default": null,
>> >   "doc": "something wrong"
>> > }
>> >
>> > and try to consumer some topic having this schema from beginning, it fails 
>> > with message:
>> >
>> > Caused by: java.lang.ArrayIndexOutOfBoundsException: 5
>> > at 
>> > org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:424)
>> > at 
>> > org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:290)
>> > at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
>> > at 
>> > org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267)
>> > at 
>> > org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:179)
>> > at 
>> > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
>> > at 
>> > org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:232)
>> > at 
>> > org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:222)
>> > at 
>> > org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
>> > at 
>> > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
>> > at 
>> > org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:179)
>> > at 
>> > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
>> > at 
>> 

Re: AVRO schema evolution: adding optional column with default fails deserialization

2019-07-30 Thread Ryan Skraba
Hello!  Schema evolution relies on both the writer and reader schemas
being available.

It looks like the allegro tool you are using is using the
GenericDatumReader that assumes the reader and writer schema are the
same:

https://github.com/allegro/json-avro-converter/blob/json-avro-converter-0.2.8/converter/src/main/java/tech/allegro/schema/json2avro/converter/JsonAvroConverter.java#L83

I do not believe that the "default" value is taken into account for
data that is strictly missing from the binary input, just when a field
is known to be in the reader schema but missing from the original
writer.

You may have more luck reading the GenericRecord with a
GenericDatumReader with both schemas, and using the
`convertToJson(record)`.

I hope this is useful -- Ryan



On Tue, Jul 30, 2019 at 10:20 AM Martin Mucha  wrote:
>
> Hi,
>
> I've got some issues/misunderstanding of AVRO schema evolution.
>
> When reading through avro documentation, for example [1], I understood, that 
> schema evolution is supported, and if I added column with specified default, 
> it should be backwards compatible (and even forward when I remove it again). 
> Sounds great, so I added column defined as:
>
> {
>   "name": "newColumn",
>   "type": ["null","string"],
>   "default": null,
>   "doc": "something wrong"
> }
>
> and try to consumer some topic having this schema from beginning, it fails 
> with message:
>
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 5
> at 
> org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:424)
> at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:290)
> at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
> at 
> org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267)
> at 
> org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:179)
> at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
> at 
> org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:232)
> at 
> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:222)
> at 
> org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
> at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
> at 
> org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:179)
> at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
> at 
> org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:232)
> at 
> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:222)
> at 
> org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
> at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
> at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145)
> at 
> tech.allegro.schema.json2avro.converter.JsonAvroConverter.convertToJson(JsonAvroConverter.java:83)
> to give a little bit more information. Avro schema defines one top level 
> type, having 2 fields. String describing type of message, and union of N 
> types. All N-1, non-modified types can be read, but one updated with 
> optional, default-having column cannot be read. I'm not sure if this design 
> is strictly speaking correct, but that's not the point (feel free to 
> criticise and recommend better approach!). I'm after schema evolution, which 
> seems not to be working.
>
>
> And if we alter type definition to:
>
> "type": "string",
> "default": ""
> it still does not work and generated error is:
>
> Caused by: org.apache.avro.AvroRuntimeException: Malformed data. Length is 
> negative: -1
> at org.apache.avro.io.BinaryDecoder.doReadBytes(BinaryDecoder.java:336)
> at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:263)
> at 
> org.apache.avro.io.ResolvingDecoder.readString(ResolvingDecoder.java:201)
> at 
> org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:422)
> at 
> org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:414)
> at 
> org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:181)
> at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
> at 
> org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:232)
> at 
> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:222)
> at 
> org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
> at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
> at 
> 

Re: Reg: Avrojob schema validation option.

2019-07-30 Thread Ryan Skraba
Hello!  I'm not sure I understand your question.  Some names are
*required* with a specific format in the Avro specification
(http://avro.apache.org/docs/1.8.2/spec.html#names)

What are you looking to accomplish?  I can think of two scenarios that
we've seen in the past: (1) anonymous records where the name has no
interest, and (2) mapping a structure that supports arbitrary UTF-8
names (like a database table) to a record with the same field names.
Neither of those are supported in the Avro specification.

For the first case (where we don't care about the record name), we
just autogenerated a "safe" but unused record name.

For the second case, we used a custom annotation on the field
(something like "display.name") to contain the original value and
generated a "safe" field name.

In both cases, being safe means that it meets the Avro spec
([A-Za-z_][A-Za-z0-9_]*) and avoids collisions with other generated
names.

I hope this helps!  Ryan

On Fri, Jul 26, 2019 at 1:52 PM SB M  wrote:
>
> Hi All,
>
>  Problem: I need a option to set name validation for schema parsing, when 
> setting with avrojob and avromultipleinputs.
>
> Is there any way to set schema name validation to false currently, when  am 
> ho through source code am not able to find any options like that.
>
> Please give a some solution.
>
> Regards,
> Sree.
>


Re: Should a Schema be serializable in Java?

2019-07-18 Thread Ryan Skraba
Hello!  I'm motivated to see this happen :D

+Zoltan, the original author.  I created a PR against apache/avro master
here: https://github.com/apache/avro/pull/589

I cherry-picked the commit from your fork, and reapplied
spotless/checkstyle.  I hope this is the correct way to preserve authorship
and that I'm not stepping on any toes!

Can someone take a look at the above PR?

Best regards,

Ryan

On Tue, Jul 16, 2019 at 11:58 AM Ismaël Mejía  wrote:

> Yes probably it is overkill to warn given the examples you mention.
> Also your argument towards reusing the mature (and battle tested)
> combination of Schema.Parser + String serialization makes sense.
>
> Adding this to 1.9.1 will be an extra selling point for projects
> wanting to migrate to the latest version of Avro so it sounds good to
> me but you should add it to master and then we can cherry pick it from
> there.
>
>
> On Tue, Jul 16, 2019 at 11:16 AM Ryan Skraba  wrote:
> >
> > Hello!  Thanks to the reference to AVRO-1852. It's exactly what I was
> looking for.
> >
> > I agree that Java serialization shouldn't be used for anything
> cross-platform, or (in my opinion) used for any data persistence at all.
> Especially not for an Avro container file or sending binary data through a
> messaging system...
> >
> > But Java serialization is definitely useful and used for sending
> instances of "distributed work" implemented in Java from node to node in a
> cluster.  I'm not too worried about existing connectors -- we can see that
> each framework has "solved" the problem one at a time.  In addition to
> Flink, there's
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroUtils.java#L29
> and
> https://github.com/apache/spark/blob/3663dbe541826949cecf5e1ea205fe35c163d147/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOutputWriterFactory.scala#L35
> .
> >
> > Specifically, I see the advantage for user-defined distributed functions
> that happen to carry along an Avro Schema -- and I can personally say that
> I've encountered this a lot in our code!
> >
> > That being said, I think it's probably overkill to warn the user about
> the perils of Java serialization (not being cross-language and requiring
> consistent JDKs and libraries across JVMs).  If an error occurs for one of
> those reasons, there's a larger problem for the dev to address, and it's
> just as likely to occur for any Java library in the job if the environment
> is bad.  Related, we've encountered similar issues with logical types
> existing in Avro 1.8 in the driver but not in Avro 1.7 on the cluster...
> the solution is "make sure you don't do that".  (Looking at you, guava and
> jackson!)
> >
> > The patch in question delegates serialization to the string form of the
> schema, so it's basically doing what all of the above Avro "holders" are
> doing -- I wouldn't object to having a sample schema available that fully
> exercises what a schema can hold, but I also think that Schema.Parser (used
> underneath) is currently pretty well tested and mature!
> >
> > Do you think this could be a candidate for 1.9.1 as a minor
> improvement?  I can't think of any reason that this wouldn't be backwards
> compatible.
> >
> > Ryan
> >
> > side note: I wrote java.lang.Serializable earlier, which probably didn't
> help my search for prior discussion... :/
> >
> > On Tue, Jul 16, 2019 at 9:59 AM Ismaël Mejía  wrote:
> >>
> >> This is a good idea even if it may have some issues that we should
> >> probably document and warn users about:
> >>
> >> 1. Java based serialization is really practical for JVM based systems,
> >> but we should probably add a warning or documentation because Java
> >> serialization is not deterministic between JVMs so this could be a
> >> source for issues (usually companies use the same version of the JVM
> >> so this is less critical, but this still can happen specially now with
> >> all the different versions of Java and OpenJDK based flavors).
> >>
> >> 2. This is not cross language compatible, the String based
> >> representation (or even an Avro based representation of Schema) can be
> >> used in every language.
> >>
> >> Even with these I think just for ease of use it is worth to make
> >> Schema Serializable. Is the plan to fully serialize it, or just to
> >> make it a String and serialize the String as done in the issue Doug
> >> mentioned?
> >> If we take the first approach we need to properly test with a Schema
> >> that has elements of the full 

Re: Should a Schema be serializable in Java?

2019-07-16 Thread Ryan Skraba
Hello!  Thanks to the reference to AVRO-1852. It's exactly what I was
looking for.

I agree that Java serialization shouldn't be used for anything
cross-platform, or (in my opinion) used for any *data* persistence at all.
Especially not for an Avro container file or sending binary data through a
messaging system...

But Java serialization is definitely useful and used for sending instances
of "distributed work" implemented in Java from node to node in a cluster.
I'm not too worried about existing connectors -- we can see that each
framework has "solved" the problem one at a time.  In addition to Flink,
there's
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroUtils.java#L29
 and
https://github.com/apache/spark/blob/3663dbe541826949cecf5e1ea205fe35c163d147/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOutputWriterFactory.scala#L35
.

Specifically, I see the advantage for user-defined distributed functions
that happen to carry along an Avro Schema -- and I can personally say that
I've encountered this a lot in our code!

That being said, I think it's probably overkill to warn the user about the
perils of Java serialization (not being cross-language and requiring
consistent JDKs and libraries across JVMs).  If an error occurs for one of
those reasons, there's a larger problem for the dev to address, and it's
just as likely to occur for any Java library in the job if the environment
is bad.  Related, we've encountered similar issues with logical types
existing in Avro 1.8 in the driver but not in Avro 1.7 on the cluster...
the solution is "make sure you don't do that".  (Looking at you, guava and
jackson!)

The patch in question delegates serialization to the string form of the
schema, so it's basically doing what all of the above Avro "holders" are
doing -- I wouldn't object to having a sample schema available that fully
exercises what a schema can hold, but I also think that Schema.Parser (used
underneath) is currently pretty well tested and mature!

Do you think this could be a candidate for 1.9.1 as a minor improvement?  I
can't think of any reason that this wouldn't be backwards compatible.

Ryan

side note: I wrote java.lang.Serializable earlier, which probably didn't
help my search for prior discussion... :/

On Tue, Jul 16, 2019 at 9:59 AM Ismaël Mejía  wrote:

> This is a good idea even if it may have some issues that we should
> probably document and warn users about:
>
> 1. Java based serialization is really practical for JVM based systems,
> but we should probably add a warning or documentation because Java
> serialization is not deterministic between JVMs so this could be a
> source for issues (usually companies use the same version of the JVM
> so this is less critical, but this still can happen specially now with
> all the different versions of Java and OpenJDK based flavors).
>
> 2. This is not cross language compatible, the String based
> representation (or even an Avro based representation of Schema) can be
> used in every language.
>
> Even with these I think just for ease of use it is worth to make
> Schema Serializable. Is the plan to fully serialize it, or just to
> make it a String and serialize the String as done in the issue Doug
> mentioned?
> If we take the first approach we need to properly test with a Schema
> that has elements of the full specification that (de)-serialization
> works correctly. Does anyone know if we have already a test schema
> that covers the full ‘schema’ specification to reuse it if so?
>
> On Mon, Jul 15, 2019 at 11:46 PM Driesprong, Fokko 
> wrote:
> >
> > Correct me if I'm wrong here. But as far as I understood the way of
> > serializing the schema is using Avro, as it is part of the file. To avoid
> > confusion there should be one way of serializing.
> >
> > However, I'm not sure if this is worth the hassle of not simply
> > implementing serializable. Also Flink there is a rather far from optimal
> > implementation:
> >
> https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/ParquetAvroWriters.java#L72
> > This converts it to JSON and back while distributing the schema to the
> > executors.
> >
> > Cheers, Fokko
> >
> > Op ma 15 jul. 2019 om 23:03 schreef Doug Cutting :
> >
> > > I can't think of a reason Schema should not implement Serializable.
> > >
> > > There's actually already an issue & patch for this:
> > >
> > > https://issues.apache.org/jira/browse/AVRO-1852
> > >
> > > Doug
> > >
> > > On Mon, Jul 15, 2019 at 6:49 AM Ismaël Mejía 
> wrote:
> > >
> > > > +d...@avro.apache.org
> > > >
&

Should a Schema be serializable in Java?

2019-07-15 Thread Ryan Skraba
Hello!

I'm looking for any discussion or reference why the Schema object isn't
serializable -- I'm pretty sure this must have already been discussed (but
the keywords +avro +serializable +schema have MANY results in all the
searches I did: JIRA, stack overflow, mailing list, web)

In particular, I was at a demo today where we were asked why Schemas needed
to be passed as strings to run in distributed tasks.  I remember running
into this problem years ago with MapReduce, and again in Spark, and again
in Beam...

Is there any downside to making a Schema implement java.lang.Serializable?
The only thing I can think of is that the schema _should not_ be serialized
with the data, and making it non-serializable loosely enforces this (at the
cost of continually writing different flavours of "Avro holders" for when
you really do want to serialize it).

Willing to create a JIRA and work on the implementation, of course!

All my best, Ryan