Ruby gem fork - contribute back?
Hi, For a Ruby project, I am using AVRO schemas to validate Ruby objects. Because I ran into some issues with the official avro gem, so I forked it: https://github.com/wvanbergen/tros. (The name probably only makes sense to Dutch people :) ### Changes - Fixed a round trip encoding issue for union(double, int) types. Integers were being encoded as floats, and read back as float. In Ruby versions 2.0 and later, a float == bigint equality check will return false. This caused a test to fail. - Fix UTF-8 support for Ruby 1.9+, and JRuby. The original code was written for Ruby 1.8, and there's some big changes to how to properly do this in Ruby 1.9+ and JRuby. - Remove monkey patching of Enumerable Monkey patching builtin objects is frowned upon, especially in libraries. Fixing it was easy: https://github.com/wvanbergen/tros/commit/c81d6189277111008ebb05239af91d286dd01061 - Dropped dependency of yajl-ruby and/or multi_json. The yajl-ruby dependency was causing compatibility issues with the rest of my application, and there's no released version yet with working multi_json (1.7.6 cannot be installed because multi_json is misspelled multi-json). Instead of fixing that, I decided to simply use Ruby's built in support for JSON. For libraries, the less external dependencies the better. I also did some heavy refactoring to make the Ruby codebase work outside of the context of the greater Avro project, and applied some best practices of the Ruby ecosystem. Finally, I set up CI (https://travis-ci.org/wvanbergen/tros) that checks the gem on multiple Ruby versions. ### Contributing back? I would like to contribute back my changes if you are interested. However, maintaining Ruby 1.8 support will make this very hard. Ruby 1.8 doesn't come with built in JSON support, and it's unicode handling is severely broken. It is also no longer maintained: https://www.ruby-lang.org/en/news/2013/12/17/maintenance-of-1-8-7-and-1-9-2/ Is it acceptable to drop support for Ruby 1.8? If so, I can work with you to get my changes back into the main codebase. Cheers, Willem van Bergen
Re: Ruby gem fork - contribute back?
how far back did you fork? could we have a Ruby 1.8 gem and a Ruby 1.9+ gem? we have python and python 3 support broken out, for example. On Wed, Jun 25, 2014 at 3:51 AM, Willem van Bergen wil...@vanbergen.org wrote: Hi, For a Ruby project, I am using AVRO schemas to validate Ruby objects. Because I ran into some issues with the official avro gem, so I forked it: https://github.com/wvanbergen/tros. (The name probably only makes sense to Dutch people :) ### Changes - Fixed a round trip encoding issue for union(double, int) types. Integers were being encoded as floats, and read back as float. In Ruby versions 2.0 and later, a float == bigint equality check will return false. This caused a test to fail. - Fix UTF-8 support for Ruby 1.9+, and JRuby. The original code was written for Ruby 1.8, and there's some big changes to how to properly do this in Ruby 1.9+ and JRuby. - Remove monkey patching of Enumerable Monkey patching builtin objects is frowned upon, especially in libraries. Fixing it was easy: https://github.com/wvanbergen/tros/commit/c81d6189277111008ebb05239af91d286dd01061 - Dropped dependency of yajl-ruby and/or multi_json. The yajl-ruby dependency was causing compatibility issues with the rest of my application, and there's no released version yet with working multi_json (1.7.6 cannot be installed because multi_json is misspelled multi-json). Instead of fixing that, I decided to simply use Ruby's built in support for JSON. For libraries, the less external dependencies the better. I also did some heavy refactoring to make the Ruby codebase work outside of the context of the greater Avro project, and applied some best practices of the Ruby ecosystem. Finally, I set up CI ( https://travis-ci.org/wvanbergen/tros) that checks the gem on multiple Ruby versions. ### Contributing back? I would like to contribute back my changes if you are interested. However, maintaining Ruby 1.8 support will make this very hard. Ruby 1.8 doesn't come with built in JSON support, and it's unicode handling is severely broken. It is also no longer maintained: https://www.ruby-lang.org/en/news/2013/12/17/maintenance-of-1-8-7-and-1-9-2/ Is it acceptable to drop support for Ruby 1.8? If so, I can work with you to get my changes back into the main codebase. Cheers, Willem van Bergen -- Sean
Re: Ruby gem fork - contribute back?
I forked off trunk 2 days ago. It's possible to have 2 different gems, but this is not very common in the Ruby world. Because Ruby 1.8 is not maintained anymore, not even for security issues, most people have moved on to newer versions. This is in contrast with Python 2, which is still maintained and heavily used. My preference would be to document that the last release of avro that supports Ruby 1.8 is 1.7.5. (Version 1.7.6 won't install because of the multi_json issue). Maintaining 1.8 compatibility will be harder and harder over time and hold back development. E.g. it is already hard to even install a Ruby 1.8 version on a recent OSX due to compiler changes. Cheers, Willem On Jun 25, 2014, at 5:06 AM, Sean Busbey bus...@cloudera.com wrote: how far back did you fork? could we have a Ruby 1.8 gem and a Ruby 1.9+ gem? we have python and python 3 support broken out, for example. On Wed, Jun 25, 2014 at 3:51 AM, Willem van Bergen wil...@vanbergen.org wrote: Hi, For a Ruby project, I am using AVRO schemas to validate Ruby objects. Because I ran into some issues with the official avro gem, so I forked it: https://github.com/wvanbergen/tros. (The name probably only makes sense to Dutch people :) ### Changes - Fixed a round trip encoding issue for union(double, int) types. Integers were being encoded as floats, and read back as float. In Ruby versions 2.0 and later, a float == bigint equality check will return false. This caused a test to fail. - Fix UTF-8 support for Ruby 1.9+, and JRuby. The original code was written for Ruby 1.8, and there's some big changes to how to properly do this in Ruby 1.9+ and JRuby. - Remove monkey patching of Enumerable Monkey patching builtin objects is frowned upon, especially in libraries. Fixing it was easy: https://github.com/wvanbergen/tros/commit/c81d6189277111008ebb05239af91d286dd01061 - Dropped dependency of yajl-ruby and/or multi_json. The yajl-ruby dependency was causing compatibility issues with the rest of my application, and there's no released version yet with working multi_json (1.7.6 cannot be installed because multi_json is misspelled multi-json). Instead of fixing that, I decided to simply use Ruby's built in support for JSON. For libraries, the less external dependencies the better. I also did some heavy refactoring to make the Ruby codebase work outside of the context of the greater Avro project, and applied some best practices of the Ruby ecosystem. Finally, I set up CI ( https://travis-ci.org/wvanbergen/tros) that checks the gem on multiple Ruby versions. ### Contributing back? I would like to contribute back my changes if you are interested. However, maintaining Ruby 1.8 support will make this very hard. Ruby 1.8 doesn't come with built in JSON support, and it's unicode handling is severely broken. It is also no longer maintained: https://www.ruby-lang.org/en/news/2013/12/17/maintenance-of-1-8-7-and-1-9-2/ Is it acceptable to drop support for Ruby 1.8? If so, I can work with you to get my changes back into the main codebase. Cheers, Willem van Bergen -- Sean
Re: Ruby gem fork - contribute back?
IIRC, the multijson issue is fixed in the current snapshot. I dunno, I certainly stopped using Ruby 1.8 several years ago. The issue is that Avro has a strong history of favoring compatibility. It would be surprising for us to drop Ruby 1.8 support while still in the Avro 1.7 line. We could plan to only support Ruby 1.9+ in Avro 1.8 and take a contribution that targeted that, maybe? -- Sean On Jun 25, 2014 4:16 AM, Willem van Bergen wil...@vanbergen.org wrote: I forked off trunk 2 days ago. It's possible to have 2 different gems, but this is not very common in the Ruby world. Because Ruby 1.8 is not maintained anymore, not even for security issues, most people have moved on to newer versions. This is in contrast with Python 2, which is still maintained and heavily used. My preference would be to document that the last release of avro that supports Ruby 1.8 is 1.7.5. (Version 1.7.6 won't install because of the multi_json issue). Maintaining 1.8 compatibility will be harder and harder over time and hold back development. E.g. it is already hard to even install a Ruby 1.8 version on a recent OSX due to compiler changes. Cheers, Willem On Jun 25, 2014, at 5:06 AM, Sean Busbey bus...@cloudera.com wrote: how far back did you fork? could we have a Ruby 1.8 gem and a Ruby 1.9+ gem? we have python and python 3 support broken out, for example. On Wed, Jun 25, 2014 at 3:51 AM, Willem van Bergen wil...@vanbergen.org wrote: Hi, For a Ruby project, I am using AVRO schemas to validate Ruby objects. Because I ran into some issues with the official avro gem, so I forked it: https://github.com/wvanbergen/tros. (The name probably only makes sense to Dutch people :) ### Changes - Fixed a round trip encoding issue for union(double, int) types. Integers were being encoded as floats, and read back as float. In Ruby versions 2.0 and later, a float == bigint equality check will return false. This caused a test to fail. - Fix UTF-8 support for Ruby 1.9+, and JRuby. The original code was written for Ruby 1.8, and there's some big changes to how to properly do this in Ruby 1.9+ and JRuby. - Remove monkey patching of Enumerable Monkey patching builtin objects is frowned upon, especially in libraries. Fixing it was easy: https://github.com/wvanbergen/tros/commit/c81d6189277111008ebb05239af91d286dd01061 - Dropped dependency of yajl-ruby and/or multi_json. The yajl-ruby dependency was causing compatibility issues with the rest of my application, and there's no released version yet with working multi_json (1.7.6 cannot be installed because multi_json is misspelled multi-json). Instead of fixing that, I decided to simply use Ruby's built in support for JSON. For libraries, the less external dependencies the better. I also did some heavy refactoring to make the Ruby codebase work outside of the context of the greater Avro project, and applied some best practices of the Ruby ecosystem. Finally, I set up CI ( https://travis-ci.org/wvanbergen/tros) that checks the gem on multiple Ruby versions. ### Contributing back? I would like to contribute back my changes if you are interested. However, maintaining Ruby 1.8 support will make this very hard. Ruby 1.8 doesn't come with built in JSON support, and it's unicode handling is severely broken. It is also no longer maintained: https://www.ruby-lang.org/en/news/2013/12/17/maintenance-of-1-8-7-and-1-9-2/ Is it acceptable to drop support for Ruby 1.8? If so, I can work with you to get my changes back into the main codebase. Cheers, Willem van Bergen -- Sean
[jira] [Updated] (AVRO-1532) Field deletion not possible for ReflectData: NPE
[ https://issues.apache.org/jira/browse/AVRO-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] O. Reißig updated AVRO-1532: Attachment: RemovalOfUnionSubtype.java Thanks for your suggestion with AvroAlias, that's a lot better than my first try :-) Explicitly stating the schema does indeed resolve the issue with removed fields, but now I run into problems with union types. In my schema I have a Map containing an abstract type, that has several possible implementations via @Union. I'd like the rest of the serialized data to still be readable when removing one of the Union subtypes. I attached a test case to illustrate my example. In the real world, accessing the removed type would yield ClassCastException, but of course I cannot actually remove a class in this test case. As the stuff is stored in a Map, I should still be able to access the rest of the map entries. Did I make my use case clear? Is this realistic? Why actually store the schema inside the serialized file, if it is overridden anyway? I thought it's better to parse the serialized data with its according schema. Field deletion not possible for ReflectData: NPE Key: AVRO-1532 URL: https://issues.apache.org/jira/browse/AVRO-1532 Project: Avro Issue Type: Bug Components: java Affects Versions: 1.7.6 Reporter: O. Reißig Labels: java, reflection Attachments: AVRO-1532.patch, ReflectDataFieldRemovalTest.java, ReflectDataFieldRemovalTest.java, RemovalOfUnionSubtype.java *Actual behaviour:* I have a field in my reflection-based schema like this: {code} @Nullable @AvroDefault(null) public Long someField; {code} When removing this field, parsing the previous serialized blob yields NullPointerException: {noformat} java.lang.NullPointerException at org.apache.avro.reflect.ReflectData.setField(ReflectData.java:128) at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193) at org.apache.avro.reflect.ReflectDatumReader.readField(ReflectDatumReader.java:230) at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142) at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233) at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220) at ReflectDataFieldRemovalTest.testFieldRemoval(ReflectDataFieldRemovalTest.java:41) {noformat} *Expected behaviour:* Field removal is crucial for schema evolution and must be possible with ReflectData. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: Ruby gem fork - contribute back?
Dropping support for Ruby 1.8 in the Avro 1.8.x series sounds like a plan. Is there already a branch for the 1.8 series? Until that time happens, I can maintain my fork for people requiring unicode UTF support on Ruby 1.9+. I know the multi_json issue is fixed in trunk. However, due to the project's structure, it's very hard to use a non-release version inside a project. Because the project doesn't include a gemspec file, you cannot make Bundler use the latest trunk version. (In my case, I use avro inside of another gem. Gem can only depend on released versions of other gems, so I had to fork release it.) Willem On Jun 25, 2014, at 5:33 AM, Sean Busbey bus...@cloudera.com wrote: IIRC, the multijson issue is fixed in the current snapshot. I dunno, I certainly stopped using Ruby 1.8 several years ago. The issue is that Avro has a strong history of favoring compatibility. It would be surprising for us to drop Ruby 1.8 support while still in the Avro 1.7 line. We could plan to only support Ruby 1.9+ in Avro 1.8 and take a contribution that targeted that, maybe? -- Sean On Jun 25, 2014 4:16 AM, Willem van Bergen wil...@vanbergen.org wrote: I forked off trunk 2 days ago. It's possible to have 2 different gems, but this is not very common in the Ruby world. Because Ruby 1.8 is not maintained anymore, not even for security issues, most people have moved on to newer versions. This is in contrast with Python 2, which is still maintained and heavily used. My preference would be to document that the last release of avro that supports Ruby 1.8 is 1.7.5. (Version 1.7.6 won't install because of the multi_json issue). Maintaining 1.8 compatibility will be harder and harder over time and hold back development. E.g. it is already hard to even install a Ruby 1.8 version on a recent OSX due to compiler changes. Cheers, Willem On Jun 25, 2014, at 5:06 AM, Sean Busbey bus...@cloudera.com wrote: how far back did you fork? could we have a Ruby 1.8 gem and a Ruby 1.9+ gem? we have python and python 3 support broken out, for example. On Wed, Jun 25, 2014 at 3:51 AM, Willem van Bergen wil...@vanbergen.org wrote: Hi, For a Ruby project, I am using AVRO schemas to validate Ruby objects. Because I ran into some issues with the official avro gem, so I forked it: https://github.com/wvanbergen/tros. (The name probably only makes sense to Dutch people :) ### Changes - Fixed a round trip encoding issue for union(double, int) types. Integers were being encoded as floats, and read back as float. In Ruby versions 2.0 and later, a float == bigint equality check will return false. This caused a test to fail. - Fix UTF-8 support for Ruby 1.9+, and JRuby. The original code was written for Ruby 1.8, and there's some big changes to how to properly do this in Ruby 1.9+ and JRuby. - Remove monkey patching of Enumerable Monkey patching builtin objects is frowned upon, especially in libraries. Fixing it was easy: https://github.com/wvanbergen/tros/commit/c81d6189277111008ebb05239af91d286dd01061 - Dropped dependency of yajl-ruby and/or multi_json. The yajl-ruby dependency was causing compatibility issues with the rest of my application, and there's no released version yet with working multi_json (1.7.6 cannot be installed because multi_json is misspelled multi-json). Instead of fixing that, I decided to simply use Ruby's built in support for JSON. For libraries, the less external dependencies the better. I also did some heavy refactoring to make the Ruby codebase work outside of the context of the greater Avro project, and applied some best practices of the Ruby ecosystem. Finally, I set up CI ( https://travis-ci.org/wvanbergen/tros) that checks the gem on multiple Ruby versions. ### Contributing back? I would like to contribute back my changes if you are interested. However, maintaining Ruby 1.8 support will make this very hard. Ruby 1.8 doesn't come with built in JSON support, and it's unicode handling is severely broken. It is also no longer maintained: https://www.ruby-lang.org/en/news/2013/12/17/maintenance-of-1-8-7-and-1-9-2/ Is it acceptable to drop support for Ruby 1.8? If so, I can work with you to get my changes back into the main codebase. Cheers, Willem van Bergen -- Sean
[jira] [Commented] (AVRO-1124) RESTful service for holding schemas
[ https://issues.apache.org/jira/browse/AVRO-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043745#comment-14043745 ] Thunder Stumpges commented on AVRO-1124: Hi Francois, I noticed you applied these patches, and the code tweaks from Felix in mate1/release-1.7.5-with-AVRO-1124. I was able to pull and build it perfectly. Thanks a lot! For others, here's the branch from trunk with patches for both AVRO-1315 and AVRO-1124 applied: https://github.com/mate1/avro/tree/from-98ec5f2a172391cb5dfa7b4d85f39065bae22754-with-AVRO-1315-and-AVRO-1124 BTW, are we expecting AVRO-1124 to end up in an 'official' AVRO release or will this constantly be an open issue with a set of patches? I don't necessarily mind this way; we have figured it out and gotten it working. I'm just curious what the end plan is for it. I'd be glad to help out in whatever way I can. Thanks again everyone. Thunder RESTful service for holding schemas --- Key: AVRO-1124 URL: https://issues.apache.org/jira/browse/AVRO-1124 Project: Avro Issue Type: New Feature Reporter: Jay Kreps Assignee: Jay Kreps Attachments: AVRO-1124-can-read-with.patch, AVRO-1124-draft.patch, AVRO-1124-validators-preliminary.patch, AVRO-1124.patch, AVRO-1124.patch Motivation: It is nice to be able to pass around data in serialized form but still know the exact schema that was used to serialize it. The overhead of storing the schema with each record is too high unless the individual records are very large. There are workarounds for some common cases: in the case of files a schema can be stored once with a file of many records amortizing the per-record cost, and in the case of RPC the schema can be negotiated ahead of time and used for many requests. For other uses, though it is nice to be able to pass a reference to a given schema using a small id and allow this to be looked up. Since only a small number of schemas are likely to be active for a given data source, these can easily be cached, so the number of remote lookups is very small (one per active schema version). Basically this would consist of two things: 1. A simple REST service that stores and retrieves schemas 2. Some helper java code for fetching and caching schemas for people using the registry We have used something like this at LinkedIn for a few years now, and it would be nice to standardize this facility to be able to build up common tooling around it. This proposal will be based on what we have, but we can change it as ideas come up. The facilities this provides are super simple, basically you can register a schema which gives back a unique id for it or you can query for a schema. There is almost no code, and nothing very complex. The contract is that before emitting/storing a record you must first publish its schema to the registry or know that it has already been published (by checking your cache of published schemas). When reading you check your cache and if you don't find the id/schema pair there you query the registry to look it up. I will explain some of the nuances in more detail below. An added benefit of such a repository is that it makes a few other things possible: 1. A graphical browser of the various data types that are currently used and all their previous forms. 2. Automatic enforcement of compatibility rules. Data is always compatible in the sense that the reader will always deserialize it (since they are using the same schema as the writer) but this does not mean it is compatible with the expectations of the reader. For example if an int field is changed to a string that will almost certainly break anyone relying on that field. This definition of compatibility can differ for different use cases and should likely be pluggable. Here is a description of one of our uses of this facility at LinkedIn. We use this to retain a schema with log data end-to-end from the producing app to various real-time consumers as well as a set of resulting AvroFile in Hadoop. This schema metadata can then be used to auto-create hive tables (or add new fields to existing tables), or inferring pig fields, all without manual intervention. One important definition of compatibility that is nice to enforce is compatibility with historical data for a given table. Log data is usually loaded in an append-only manner, so if someone changes an int field in a particular data set to be a string, tools like pig or hive that expect static columns will be unusable. Even using plain-vanilla map/reduce processing data where columns and types change willy nilly is painful. However the person emitting this kind of data may not know all the details of compatible schema evolution. We use the schema repository to
[jira] [Commented] (AVRO-1124) RESTful service for holding schemas
[ https://issues.apache.org/jira/browse/AVRO-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043777#comment-14043777 ] Thunder Stumpges commented on AVRO-1124: Hi Felix, was just reading through comments, and saw this from you. We are going through the exact same thing right now as well. {quote} There are certain avro schemas that we use across many Kafka topics (1:M relationship between schema and topics). I would like to benefit from the facilitated evolution capabilities of the schema repo, but I'm not 100% sure of the best way to proceed. I would like to avoid: 1. Having to register the same schema (and each further schema evolutions) into many subjects. 2. Having to externally manage a mapping of Kafka topic = subject registered into the repo. {quote} We also are trying to avoid this. We are currently planning to take a topic naming convention approach where we combine the subject name (avro class FQN) with a topic suffix when naming our topic. So a topic would be named: 'subject_name--topic_suffix' where we don't use the -- delimeter in either subject name or suffix. I think this avoids both of the above issues, as well as not requiring each message to have a subject ID. It does however add complexity to all consumers on how to parse topic names. {quote} Another possibility would be to introduce the concept of a SubjectAlias. The way it would work is that you would register a SubjectAlias with an aliasName and a targetName. If the aliasName already exists, or if the targetName does not exist, the operation would fail. Afterwards, any lookup for the aliasName would return a DelegatingSubject containing the Subject referenced by the targetName of the alias that was looked up. This change seems clean and not too intrusive, and also wouldn't require encoding both subject ID and schema ID in my message payloads. But perhaps there are problems to this approach that I haven't thought of. Do you think this approach makes sense? And would it be worth contributing back into the main schema repo code? {quote} I think this would be a fine approach, and that would simplify our kafka consumers to not need to understand the convention we came up with. It would also free you to use any convenient topic name for any subject schema without having to adhere to that convention on naming. Have you had any progress or other thoughts on this issue since January? I realize I'm a little late to the party :) Cheers, Thunder RESTful service for holding schemas --- Key: AVRO-1124 URL: https://issues.apache.org/jira/browse/AVRO-1124 Project: Avro Issue Type: New Feature Reporter: Jay Kreps Assignee: Jay Kreps Attachments: AVRO-1124-can-read-with.patch, AVRO-1124-draft.patch, AVRO-1124-validators-preliminary.patch, AVRO-1124.patch, AVRO-1124.patch Motivation: It is nice to be able to pass around data in serialized form but still know the exact schema that was used to serialize it. The overhead of storing the schema with each record is too high unless the individual records are very large. There are workarounds for some common cases: in the case of files a schema can be stored once with a file of many records amortizing the per-record cost, and in the case of RPC the schema can be negotiated ahead of time and used for many requests. For other uses, though it is nice to be able to pass a reference to a given schema using a small id and allow this to be looked up. Since only a small number of schemas are likely to be active for a given data source, these can easily be cached, so the number of remote lookups is very small (one per active schema version). Basically this would consist of two things: 1. A simple REST service that stores and retrieves schemas 2. Some helper java code for fetching and caching schemas for people using the registry We have used something like this at LinkedIn for a few years now, and it would be nice to standardize this facility to be able to build up common tooling around it. This proposal will be based on what we have, but we can change it as ideas come up. The facilities this provides are super simple, basically you can register a schema which gives back a unique id for it or you can query for a schema. There is almost no code, and nothing very complex. The contract is that before emitting/storing a record you must first publish its schema to the registry or know that it has already been published (by checking your cache of published schemas). When reading you check your cache and if you don't find the id/schema pair there you query the registry to look it up. I will explain some of the nuances in more detail below. An added benefit of such a repository is that it makes a few other things possible:
[jira] [Commented] (AVRO-1124) RESTful service for holding schemas
[ https://issues.apache.org/jira/browse/AVRO-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043801#comment-14043801 ] Felix GV commented on AVRO-1124: Hi Thunder, Disclaimer: I don't work for Mate1 anymore, so what I'm going to say might be out of date by now. The Mate1 guys will need to chime in for the latest state of their work on this... I did not end up coding SubjectAliases into Avro proper because there didn't seem to be any interest from the OSS community and I had limited time to build this completely generically. Mate1 had a strategy that is kind of similar to the one you're describing. We had arbitrary topic names, which could be dynamically appended with certain suffixes (like __PROCESSING_FAILURE, __DECODING_FAILURE or whatever). Those dynamically created topics could then be re-processed later on, as convenient, and would contain the same schema as the topic they derived from. In our Camus decoders, we hard-coded some topic alias resolution right there in the code (since the amount of suffixes was limited in our case). There were talks of porting that topic alias resolution logic to our schema-repo-client implementation ( https://github.com/mate1/schema-repo-client/ ), so that it is more conveniently available to all Kafka consumers (not just Camus), but that didn't end up happening before I left. So essentially, we went for option 2, above. For an organization that has more numerous/diverse suffixes, that strategy would probably not be ideal, but for small amounts of suffixes (or prefixes), it was deemed acceptable. Hopefully, that sheds some light (: -F RESTful service for holding schemas --- Key: AVRO-1124 URL: https://issues.apache.org/jira/browse/AVRO-1124 Project: Avro Issue Type: New Feature Reporter: Jay Kreps Assignee: Jay Kreps Attachments: AVRO-1124-can-read-with.patch, AVRO-1124-draft.patch, AVRO-1124-validators-preliminary.patch, AVRO-1124.patch, AVRO-1124.patch Motivation: It is nice to be able to pass around data in serialized form but still know the exact schema that was used to serialize it. The overhead of storing the schema with each record is too high unless the individual records are very large. There are workarounds for some common cases: in the case of files a schema can be stored once with a file of many records amortizing the per-record cost, and in the case of RPC the schema can be negotiated ahead of time and used for many requests. For other uses, though it is nice to be able to pass a reference to a given schema using a small id and allow this to be looked up. Since only a small number of schemas are likely to be active for a given data source, these can easily be cached, so the number of remote lookups is very small (one per active schema version). Basically this would consist of two things: 1. A simple REST service that stores and retrieves schemas 2. Some helper java code for fetching and caching schemas for people using the registry We have used something like this at LinkedIn for a few years now, and it would be nice to standardize this facility to be able to build up common tooling around it. This proposal will be based on what we have, but we can change it as ideas come up. The facilities this provides are super simple, basically you can register a schema which gives back a unique id for it or you can query for a schema. There is almost no code, and nothing very complex. The contract is that before emitting/storing a record you must first publish its schema to the registry or know that it has already been published (by checking your cache of published schemas). When reading you check your cache and if you don't find the id/schema pair there you query the registry to look it up. I will explain some of the nuances in more detail below. An added benefit of such a repository is that it makes a few other things possible: 1. A graphical browser of the various data types that are currently used and all their previous forms. 2. Automatic enforcement of compatibility rules. Data is always compatible in the sense that the reader will always deserialize it (since they are using the same schema as the writer) but this does not mean it is compatible with the expectations of the reader. For example if an int field is changed to a string that will almost certainly break anyone relying on that field. This definition of compatibility can differ for different use cases and should likely be pluggable. Here is a description of one of our uses of this facility at LinkedIn. We use this to retain a schema with log data end-to-end from the producing app to various real-time consumers as well as a set of resulting AvroFile in Hadoop. This schema metadata can then be used
Re: Ruby gem fork - contribute back?
There isn't a branch for 1.8. Patches that target that version just get generated based on trunk and attached to tickets with a fix version of 1.8.0. Generally, they also get hte incompatible flag. Sure, I agree that using the unreleased versions isn't tenable. Doug made a call for a 1.7.7 release back at the end of may[1]. It would be good to ping that thread with the important of getting something soon for the Ruby folks. [1]: http://s.apache.org/1LB On Wed, Jun 25, 2014 at 4:54 AM, Willem van Bergen wil...@vanbergen.org wrote: Dropping support for Ruby 1.8 in the Avro 1.8.x series sounds like a plan. Is there already a branch for the 1.8 series? Until that time happens, I can maintain my fork for people requiring unicode UTF support on Ruby 1.9+. I know the multi_json issue is fixed in trunk. However, due to the project's structure, it's very hard to use a non-release version inside a project. Because the project doesn't include a gemspec file, you cannot make Bundler use the latest trunk version. (In my case, I use avro inside of another gem. Gem can only depend on released versions of other gems, so I had to fork release it.) Willem On Jun 25, 2014, at 5:33 AM, Sean Busbey bus...@cloudera.com wrote: IIRC, the multijson issue is fixed in the current snapshot. I dunno, I certainly stopped using Ruby 1.8 several years ago. The issue is that Avro has a strong history of favoring compatibility. It would be surprising for us to drop Ruby 1.8 support while still in the Avro 1.7 line. We could plan to only support Ruby 1.9+ in Avro 1.8 and take a contribution that targeted that, maybe? -- Sean On Jun 25, 2014 4:16 AM, Willem van Bergen wil...@vanbergen.org wrote: I forked off trunk 2 days ago. It's possible to have 2 different gems, but this is not very common in the Ruby world. Because Ruby 1.8 is not maintained anymore, not even for security issues, most people have moved on to newer versions. This is in contrast with Python 2, which is still maintained and heavily used. My preference would be to document that the last release of avro that supports Ruby 1.8 is 1.7.5. (Version 1.7.6 won't install because of the multi_json issue). Maintaining 1.8 compatibility will be harder and harder over time and hold back development. E.g. it is already hard to even install a Ruby 1.8 version on a recent OSX due to compiler changes. Cheers, Willem On Jun 25, 2014, at 5:06 AM, Sean Busbey bus...@cloudera.com wrote: how far back did you fork? could we have a Ruby 1.8 gem and a Ruby 1.9+ gem? we have python and python 3 support broken out, for example. On Wed, Jun 25, 2014 at 3:51 AM, Willem van Bergen wil...@vanbergen.org wrote: Hi, For a Ruby project, I am using AVRO schemas to validate Ruby objects. Because I ran into some issues with the official avro gem, so I forked it: https://github.com/wvanbergen/tros. (The name probably only makes sense to Dutch people :) ### Changes - Fixed a round trip encoding issue for union(double, int) types. Integers were being encoded as floats, and read back as float. In Ruby versions 2.0 and later, a float == bigint equality check will return false. This caused a test to fail. - Fix UTF-8 support for Ruby 1.9+, and JRuby. The original code was written for Ruby 1.8, and there's some big changes to how to properly do this in Ruby 1.9+ and JRuby. - Remove monkey patching of Enumerable Monkey patching builtin objects is frowned upon, especially in libraries. Fixing it was easy: https://github.com/wvanbergen/tros/commit/c81d6189277111008ebb05239af91d286dd01061 - Dropped dependency of yajl-ruby and/or multi_json. The yajl-ruby dependency was causing compatibility issues with the rest of my application, and there's no released version yet with working multi_json (1.7.6 cannot be installed because multi_json is misspelled multi-json). Instead of fixing that, I decided to simply use Ruby's built in support for JSON. For libraries, the less external dependencies the better. I also did some heavy refactoring to make the Ruby codebase work outside of the context of the greater Avro project, and applied some best practices of the Ruby ecosystem. Finally, I set up CI ( https://travis-ci.org/wvanbergen/tros) that checks the gem on multiple Ruby versions. ### Contributing back? I would like to contribute back my changes if you are interested. However, maintaining Ruby 1.8 support will make this very hard. Ruby 1.8 doesn't come with built in JSON support, and it's unicode handling is severely broken. It is also no longer maintained: https://www.ruby-lang.org/en/news/2013/12/17/maintenance-of-1-8-7-and-1-9-2/ Is it acceptable to drop support for Ruby 1.8? If so, I can work with you to get my changes back into the main codebase. Cheers, Willem van
Re: Ruby gem fork - contribute back?
On Wed, Jun 25, 2014 at 2:16 AM, Willem van Bergen wil...@vanbergen.org wrote: It's possible to have 2 different gems, but this is not very common in the Ruby world. Because Ruby 1.8 is not maintained anymore, not even for security issues, most people have moved on to newer versions. I can see a couple of options: 1. Assume that no one actually uses Ruby 1.8 anymore, and upgrade to 1.9 in Avro 1.7.7. A change that doesn't break anyone isn't incompatible. 2. Assume some folks still use Ruby 1.8 and add a ruby1.9 fork in Avro 1.7.7. Ruby users who upgrade to Avro 1.7.7 would need to opt-in to the Ruby 1.9 version. 3. Wait until we release 1.8.0 to upgrade Avro to support Ruby 1.9. (3) seems like a bad option unless we're confident we're going to release a 1.8.0 soon, which I am not. Folks hate getting broken by upgrades. Avro is a dependency of a lot of Java applications. Having an incompatible release makes it hard for one component to upgrade without forcing all to upgrade. Either you end up with a broken stack or with one that can never upgrade. Which of (1) or (2) seems more palatable to Ruby folks? Are there other options? Doug
[jira] [Commented] (AVRO-1124) RESTful service for holding schemas
[ https://issues.apache.org/jira/browse/AVRO-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14044190#comment-14044190 ] Doug Cutting commented on AVRO-1124: BTW, are we expecting AVRO-1124 to end up in an 'official' AVRO release or will this constantly be an open issue with a set of patches? I'd love to see this in a release sooner rather than later. Since it's new functionality there should be no compatibility issues. All it takes is someone to declare a particular patch ready to be committed, and for one or more other folks to endorse that. We also need to have some confidence that, even if it's incomplete, the public APIs it exposes can be supported compatibly as functionality is improved. RESTful service for holding schemas --- Key: AVRO-1124 URL: https://issues.apache.org/jira/browse/AVRO-1124 Project: Avro Issue Type: New Feature Reporter: Jay Kreps Assignee: Jay Kreps Attachments: AVRO-1124-can-read-with.patch, AVRO-1124-draft.patch, AVRO-1124-validators-preliminary.patch, AVRO-1124.patch, AVRO-1124.patch Motivation: It is nice to be able to pass around data in serialized form but still know the exact schema that was used to serialize it. The overhead of storing the schema with each record is too high unless the individual records are very large. There are workarounds for some common cases: in the case of files a schema can be stored once with a file of many records amortizing the per-record cost, and in the case of RPC the schema can be negotiated ahead of time and used for many requests. For other uses, though it is nice to be able to pass a reference to a given schema using a small id and allow this to be looked up. Since only a small number of schemas are likely to be active for a given data source, these can easily be cached, so the number of remote lookups is very small (one per active schema version). Basically this would consist of two things: 1. A simple REST service that stores and retrieves schemas 2. Some helper java code for fetching and caching schemas for people using the registry We have used something like this at LinkedIn for a few years now, and it would be nice to standardize this facility to be able to build up common tooling around it. This proposal will be based on what we have, but we can change it as ideas come up. The facilities this provides are super simple, basically you can register a schema which gives back a unique id for it or you can query for a schema. There is almost no code, and nothing very complex. The contract is that before emitting/storing a record you must first publish its schema to the registry or know that it has already been published (by checking your cache of published schemas). When reading you check your cache and if you don't find the id/schema pair there you query the registry to look it up. I will explain some of the nuances in more detail below. An added benefit of such a repository is that it makes a few other things possible: 1. A graphical browser of the various data types that are currently used and all their previous forms. 2. Automatic enforcement of compatibility rules. Data is always compatible in the sense that the reader will always deserialize it (since they are using the same schema as the writer) but this does not mean it is compatible with the expectations of the reader. For example if an int field is changed to a string that will almost certainly break anyone relying on that field. This definition of compatibility can differ for different use cases and should likely be pluggable. Here is a description of one of our uses of this facility at LinkedIn. We use this to retain a schema with log data end-to-end from the producing app to various real-time consumers as well as a set of resulting AvroFile in Hadoop. This schema metadata can then be used to auto-create hive tables (or add new fields to existing tables), or inferring pig fields, all without manual intervention. One important definition of compatibility that is nice to enforce is compatibility with historical data for a given table. Log data is usually loaded in an append-only manner, so if someone changes an int field in a particular data set to be a string, tools like pig or hive that expect static columns will be unusable. Even using plain-vanilla map/reduce processing data where columns and types change willy nilly is painful. However the person emitting this kind of data may not know all the details of compatible schema evolution. We use the schema repository to validate that any change made to a schema don't violate the compatibility model, and reject the update if it does. We do this check both at run time, and also as part of the ant task
[jira] [Commented] (AVRO-1530) Java DataFileStream does not allow distinguishing between empty files and corrupt files
[ https://issues.apache.org/jira/browse/AVRO-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14044193#comment-14044193 ] Doug Cutting commented on AVRO-1530: That would be an incompatible change. Some folks might rely on the current behaviour. One can detect an empty file by looking at its length. No valid avro data file will ever be empty. Java DataFileStream does not allow distinguishing between empty files and corrupt files --- Key: AVRO-1530 URL: https://issues.apache.org/jira/browse/AVRO-1530 Project: Avro Issue Type: Bug Reporter: Brock Noland When writing data to HDFS, especially with Flume, it's possible to write empty files. When you run Hive queries over this data, the job fails with Not a data file. from here https://github.com/apache/avro/blob/trunk/lang/java/avro/src/main/java/org/apache/avro/file/DataFileStream.java#L102 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (AVRO-695) Cycle Reference Support
[ https://issues.apache.org/jira/browse/AVRO-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sachin Goyal updated AVRO-695: -- Attachment: circular_refs_and_nonstring_map_keys_2014_06_25.zip Cycle Reference Support --- Key: AVRO-695 URL: https://issues.apache.org/jira/browse/AVRO-695 Project: Avro Issue Type: New Feature Components: spec Affects Versions: 1.7.6 Reporter: Moustapha Cherri Attachments: avro-1.4.1-cycle.patch.gz, avro-1.4.1-cycle.patch.gz, avro_circular_references.zip, avro_circular_refs_2014_06_14.zip, circular_refs_and_nonstring_map_keys_2014_06_25.zip Original Estimate: 672h Remaining Estimate: 672h This is a proposed implementation to add cycle reference support to Avro. It basically introduce a new type named Cycle. Cycles contains a string representing the path to the other reference. For example if we have an object of type Message that have a member named previous with type Message too. If we have have this hierarchy: message previous : message2 message2 previous : message2 When serializing the cycle path for message2.previous will be previous. The implementation depend on ANTLR to evaluate those cycle at read time to resolve them. I used ANTLR 3.2. This dependency is not mandated; I just used ANTLR to speed thing up. I kept in this implementation the generated code from ANTLR though this should not be the case as this should be generated during the build. I only updated the Java code. I did not make full unit testing but you can find avrotest.Main class that can be used a preliminary test. Please do not hesitate to contact me for further clarification if this seems interresting. Best regards, Moustapha Cherri -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (AVRO-695) Cycle Reference Support
[ https://issues.apache.org/jira/browse/AVRO-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14044266#comment-14044266 ] Sachin Goyal commented on AVRO-695: --- h3.Circular References *Serialization* Extra API required (not optional): {code}ReflectData.setCircularRefIdPrefix(some-field-name){code} If the above is set, following happens: # During serialization, each record contains the extra field specified above. The value for this field is just a monotonically increasing number meant to uniquely identify each record in one particular serialization. # While writing schema, each RECORD schema is converted into a UNION schema such that it can either be a record or a string. During object serialization, if a record is seen before, it is not written as a record. Rather it is written as a string in the format: some-field-name + ID-generated-in #1 above. With this structure, reading applications have enough information to restore the circular reference if they want. This structure is also usable by languages not supporting circular reference because they will read that circular-reference as a normal string. (AllowNull also works with this). # Above field-name is included in each record as a property. This allows the readers to become aware of this field-name so that the clients do not have to specify this just to populate the circular references. Basically it makes the schema self-sufficient. *Deserialization* Extra API required (optional): {code}GenericDatumReader.setResolveCircularRefs(boolean){code} Based on #3 above, GenericDatumReader becomes circular-reference-aware. But since all GenericDatumReaders share a common GenericData instance, they are provided with another flag resolveCircularRefs to control whether they want to resolve circular references or not. If this flag is set and the serialized schema has non-null value for circular-reference-field, GenericDatumReader does the following: # If any record has circular-ref-field, store its value and the corresponding record in a map. # Look for unions which can be serialized as a record as well as a string. On finding such a record serialized as a string, replace the string with the record retreived from the map created in #1 h3.Non-string map-keys *Serialization* No extra API required. Without this patch, Avro throws an exception for non-string map-keys. This patch converts such maps into an array of records where each record has two fields: key and value. Example: MapObjX, ObjY is converted to [{key:{ObjX}, value:{ObjY}}] To do this, following is done: # In ReflectData.java, create schema for key as well as value in the non-string hash-map. Encapsulate these two schemas into a record schema and create an array schema of such records. Set property NS_MAP_FLAG to 1 and store the actual class of the map as a CLASS_PROP # While writing out a non-string map field, if NS_MAP_FLAG is set, convert map to array of records using map.entrySet() *Deserialization* No extra API required. Deserialization for non-string map-keys is pretty simple since data and the schema match exactly. So it just deserializes automatically. To create an actual map (like when using ReflectDatumReader with actual-class type-parameter), map is instantiated using CLASS_PROP if the property NS_MAP_FLAG is set to 1 h3.Testcases included The unit tests cover the following: # Circular references at multiple levels of hierarchy # Circular references within Collections and Maps. # Circular and non-circular deserialization of circularly serialized objects. # Non-string map-keys having circular references. # Non-string map-keys with nested maps. Cycle Reference Support --- Key: AVRO-695 URL: https://issues.apache.org/jira/browse/AVRO-695 Project: Avro Issue Type: New Feature Components: spec Affects Versions: 1.7.6 Reporter: Moustapha Cherri Attachments: avro-1.4.1-cycle.patch.gz, avro-1.4.1-cycle.patch.gz, avro_circular_references.zip, avro_circular_refs_2014_06_14.zip, circular_refs_and_nonstring_map_keys_2014_06_25.zip Original Estimate: 672h Remaining Estimate: 672h This is a proposed implementation to add cycle reference support to Avro. It basically introduce a new type named Cycle. Cycles contains a string representing the path to the other reference. For example if we have an object of type Message that have a member named previous with type Message too. If we have have this hierarchy: message previous : message2 message2 previous : message2 When serializing the cycle path for message2.previous will be previous. The implementation depend on ANTLR to evaluate those cycle at read time to
Re: Ruby gem fork - contribute back?
Personally, I'd rather see #2. I think it's very hard to know what the current use of Ruby 1.8 is. Support from the MRI community only ended ~1 year ago[1]. JRuby still supports running in 1.8 mode. They'll be dropping it in their next major release, but there isn't a schedule for that yet and they expect the current major release line to continue for some time after that[2]. Additionally, Heroku won't be ending support until the end of this month[3]. Even after that, it's not clear to me that they won't allow users to keep using it. As mentioned previously I'm a JRuby-in-1.9-mode user and I usually just work with the Java libraries directly. So this won't directly impact me, but I agree that it sucks when upgrades break things. So I don't feel like #1 is an option. We could also investigate maintaining a single gem that just had two implementations with in it, with the active one determined by the Ruby version. [1]: https://www.ruby-lang.org/en/news/2013/06/30/we-retire-1-8-7/ [2]: https://groups.google.com/d/msg/jruby-users/qmLpZ7qDwZo/J_iYViplcq4J [3]: https://devcenter.heroku.com/articles/ruby-support#ruby-versions -Sean On Wed, Jun 25, 2014 at 6:41 PM, Doug Cutting cutt...@apache.org wrote: On Wed, Jun 25, 2014 at 2:16 AM, Willem van Bergen wil...@vanbergen.org wrote: It's possible to have 2 different gems, but this is not very common in the Ruby world. Because Ruby 1.8 is not maintained anymore, not even for security issues, most people have moved on to newer versions. I can see a couple of options: 1. Assume that no one actually uses Ruby 1.8 anymore, and upgrade to 1.9 in Avro 1.7.7. A change that doesn't break anyone isn't incompatible. 2. Assume some folks still use Ruby 1.8 and add a ruby1.9 fork in Avro 1.7.7. Ruby users who upgrade to Avro 1.7.7 would need to opt-in to the Ruby 1.9 version. 3. Wait until we release 1.8.0 to upgrade Avro to support Ruby 1.9. (3) seems like a bad option unless we're confident we're going to release a 1.8.0 soon, which I am not. Folks hate getting broken by upgrades. Avro is a dependency of a lot of Java applications. Having an incompatible release makes it hard for one component to upgrade without forcing all to upgrade. Either you end up with a broken stack or with one that can never upgrade. Which of (1) or (2) seems more palatable to Ruby folks? Are there other options? Doug -- Sean