[jira] [Commented] (AVRO-2247) Improve Java reading performance with a new reader
[ https://issues.apache.org/jira/browse/AVRO-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027873#comment-17027873 ] Raymie Stata commented on AVRO-2247: I'm in favor. > Improve Java reading performance with a new reader > -- > > Key: AVRO-2247 > URL: https://issues.apache.org/jira/browse/AVRO-2247 > Project: Apache Avro > Issue Type: Improvement > Components: java >Reporter: Martin Jubelgas >Priority: Major > Fix For: 1.10.0 > > Attachments: Perf-Comparison.md > > > Complementary to AVRO-2090, I have been working on decoding of Avro objects > in Java and am suggesting a new implementation of a DatumReader that improves > read performance for both generic and specific records by approximately 20% > (and even more in cases of nested objects with defaults, a case I encounter a > lot in practical use). > Key concept is to create a detailed execution plan once at DatumReader. This > execution plan contains all required defaulting/lookup values so they need > not be looked up during object traversal while reading. > The reader implementation can be enabled and disabled per GenericData > instance. The system default is set via the system variable > "org.apache.avro.fastread" (defaults to "false"). > Attached a performance comparison of the existing implementation with the > proposed one. Will open a pull request with respective code in a bit (not > including interoperability with the optimizations of AVRO-2090 yet). Please > let me know your opinion of whether this is worth pursuing further. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (AVRO-2400) Avro 1.9.0 can't resolve schemas that can be resolved in 1.8.2
[ https://issues.apache.org/jira/browse/AVRO-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16850557#comment-16850557 ] Raymie Stata commented on AVRO-2400: I don't have a horse in this race, other than getting the definition of schema resolution and the schema-compatibility check to agree. Whether 1.9.x goes with the stricter (SchemaCompatibility) definition or the looser (schema resolution) approach doesn't much matter to me. Happy to let the larger community decide. That said, it doesn't seem like the larger community is likely to weight in on this topic. As a result, I think we're obligated to go with what we believe most people in the community are assuming. From this perspective, it seems like many more people are dependent on what the Java schema-resolution logic in 1.8.x is doing, which would make it the de facto standard (any evidence to the contrary?). Thus, I'd lean towards bringing the 1.9.x behavior of schema-resolution and SchemaCompatibility in line with the 1.8.x behavior or schema-resolution. But definitely open to counter arguments! > Avro 1.9.0 can't resolve schemas that can be resolved in 1.8.2 > -- > > Key: AVRO-2400 > URL: https://issues.apache.org/jira/browse/AVRO-2400 > Project: Apache Avro > Issue Type: Bug > Components: java >Reporter: Jacob Tolar >Priority: Blocker > Fix For: 1.10.0, 1.9.1 > > > The failure occurs in ResolvingGrammarGenerator when reader and writer schema > have an array of records with different full names (e.g. different > namespace). > {code:java} > Exception in thread "main" java.lang.ClassCastException: > org.apache.avro.Resolver$ReaderUnion cannot be cast to > org.apache.avro.Resolver$Container{code} > Avro 1.8.2 allowed this behavior but it now fails in 1.9.0. Looking at the > jiras and code, I don't believe this was intentional ( AVRO-2275, > [https://github.com/apache/avro/commit/39d959e1c6a1f339f03dab18289e47f27c10be7f] > ). > > It looks like there were some attempts to keep compatibility ( > [https://github.com/apache/avro/blob/branch-1.9/lang/java/avro/src/main/java/org/apache/avro/Resolver.java] > , e.g. see the commented out check for w.getFullName() in resolve()) but > this case was missed. > > See this simple example to reproduce. > [https://gist.github.com/jacobtolar/c88d43ab4e8767227891e5cdc188ffad] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AVRO-2400) Avro 1.9.0 can't resolve schemas that can be resolved in 1.8.2
[ https://issues.apache.org/jira/browse/AVRO-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849231#comment-16849231 ] Raymie Stata commented on AVRO-2400: [~jtolar] made the following observation via email: bq. Note that the resolver and SchemaCompatibility disagree on this. (The test provided would be marked 'incompatible' by SchemaCompatibility...and would have been in 1.8.x as well, even though the actual parser/resolver allowed it). If the spec is indeed updated per @rstata's suggestion then SchemaCompatibility.java should probably be updated to match as well. Yes, good catch, lines 97 and 98 of SchemaCompatibility also need to be updated to use getName vs getFullName. Note, however, that line 102 needs to use the writer's _full,_ because the aliasing part of the spec is clear as to what to do with qualified vs unqualified names. > Avro 1.9.0 can't resolve schemas that can be resolved in 1.8.2 > -- > > Key: AVRO-2400 > URL: https://issues.apache.org/jira/browse/AVRO-2400 > Project: Apache Avro > Issue Type: Bug > Components: java >Reporter: Jacob Tolar >Priority: Blocker > Fix For: 1.10.0, 1.9.1 > > > The failure occurs in ResolvingGrammarGenerator when reader and writer schema > have an array of records with different full names (e.g. different > namespace). > {code:java} > Exception in thread "main" java.lang.ClassCastException: > org.apache.avro.Resolver$ReaderUnion cannot be cast to > org.apache.avro.Resolver$Container{code} > Avro 1.8.2 allowed this behavior but it now fails in 1.9.0. Looking at the > jiras and code, I don't believe this was intentional ( AVRO-2275, > [https://github.com/apache/avro/commit/39d959e1c6a1f339f03dab18289e47f27c10be7f] > ). > > It looks like there were some attempts to keep compatibility ( > [https://github.com/apache/avro/blob/branch-1.9/lang/java/avro/src/main/java/org/apache/avro/Resolver.java] > , e.g. see the commented out check for w.getFullName() in resolve()) but > this case was missed. > > See this simple example to reproduce. > [https://gist.github.com/jacobtolar/c88d43ab4e8767227891e5cdc188ffad] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AVRO-2400) Avro 1.9.0 can't resolve schemas that can be resolved in 1.8.2
[ https://issues.apache.org/jira/browse/AVRO-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846683#comment-16846683 ] Raymie Stata commented on AVRO-2400: Sorry for the delay getting back to you. Thanks for reporting this issue! The underlying issue here is an ambiguity in the specification. The spec reads: "both schemas are enums whose names match, [or] both schemas are fixed whose sizes and names match, [or] both schemas are records with the same name." But the specification is not clear as to whether the "name" here is the name-space qualified name, or the unqualified name. The old implementation took the position that "name" here meant the unqualified name, and there doesn't seem like a good reason to reverse this approach right now. The following would be a good way to fix this bug: 1) Yes, please do submit your reproduction as a test case for the future. 2) Please also update the spec to replace "name[s]" with "(unqualified) name[s]" in the places just quoted. 3) On line 695 of Resolver.java (i.e., in the unionEquiv method), please replace the three occurrences of ".getFullName" with ".getName" -- that should fix the problem in the most surgical means possible. (I'm a bit nervous about re-ordering the WRITER_UNION and READER_UNION cases as you do in your first, suggested fix. And completely wiping out the guard on line 694 would completely eliminate any name-based checking, which would relax the spec even further, which I don't think we want to do). > Avro 1.9.0 can't resolve schemas that can be resolved in 1.8.2 > -- > > Key: AVRO-2400 > URL: https://issues.apache.org/jira/browse/AVRO-2400 > Project: Apache Avro > Issue Type: Bug > Components: java >Reporter: Jacob Tolar >Priority: Blocker > Fix For: 1.10.0, 1.9.1 > > > The failure occurs in ResolvingGrammarGenerator when reader and writer schema > have an array of records with different full names (e.g. different > namespace). > {code:java} > Exception in thread "main" java.lang.ClassCastException: > org.apache.avro.Resolver$ReaderUnion cannot be cast to > org.apache.avro.Resolver$Container{code} > Avro 1.8.2 allowed this behavior but it now fails in 1.9.0. Looking at the > jiras and code, I don't believe this was intentional ( AVRO-2275, > [https://github.com/apache/avro/commit/39d959e1c6a1f339f03dab18289e47f27c10be7f] > ). > > It looks like there were some attempts to keep compatibility ( > [https://github.com/apache/avro/blob/branch-1.9/lang/java/avro/src/main/java/org/apache/avro/Resolver.java] > , e.g. see the commented out check for w.getFullName() in resolve()) but > this case was missed. > > See this simple example to reproduce. > [https://gist.github.com/jacobtolar/c88d43ab4e8767227891e5cdc188ffad] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AVRO-2275) Refactor schema-resolution code from grammar-generation
Raymie Stata created AVRO-2275: -- Summary: Refactor schema-resolution code from grammar-generation Key: AVRO-2275 URL: https://issues.apache.org/jira/browse/AVRO-2275 Project: Apache Avro Issue Type: Improvement Components: java Reporter: Raymie Stata Assignee: Raymie Stata In my own work to extend AVRO-2090, and also in AVRO-2247, an alternative approach optimizing decoders, we were forced to re-implement Schema resolution logic because it's currently embedded deeply in ResolvingGrammarGenerator. However, in the past the Avro community found it hard to maintain multiple implementations of the schema resolution code, as it is tedious and error-prone code. In this JIRA we've refactored the resolution code into a new class called Resolver, and have rewritten ResolvingGrammarGenerator to be a client of this class. This rewrite passes the full regression suite, including bug-for-bug compatibility with a few questionable resolutions rules, such as the "soft matching" rule for record in unions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AVRO-2274) Improve resolving performance when schemas don't change
Raymie Stata created AVRO-2274: -- Summary: Improve resolving performance when schemas don't change Key: AVRO-2274 URL: https://issues.apache.org/jira/browse/AVRO-2274 Project: Apache Avro Issue Type: Improvement Components: java Reporter: Raymie Stata Assignee: Raymie Stata Decoding optimizations based on the observation that schemas don't change very much. We add special-case paths to optimize the case where a _sub_schema of the reader and the writer are the same. The specific cases are: * In the case of an enumeration, if the reader and writer are the same, then we can simply return the tag written by the writer rather than "adjust" it as if it might have been re-ordered. In fact, we can do this (directly return the tag written by the writer) as long as the reader-schema is an "extension" of the writer's in that it may have added new symbols but hasn't renumbered any of the writer's symbols. Enumerations that either don't change at all or are "extended" as defined here are the common ways to extend enumerations. (Our tests show this optimization improves performance by about 3%.) * When the reader and writer subschemas are both unions, resolution is expensive: we have an outer union preceded by a "writer-union action", but each branch of this outer union consist of union-adjust actions, which are heavy weight. We optimize this case when the reader and writer unions are the same: we fall back on the standard grammar used for a union, avoiding all these adjustments. Since unions are commonly used to encode "nullable" fields in Avro, and nullability rarely changes as a schema evolves, this optimization should help many users. (Our tests show this optimization improves performance by 25-30%, a significant win.) * The "custom code" generated for reading records has to read fields in a loop that uses a switch statement to deal with writers that may have re-ordered fields. In most cases, however, fields have not been reordered (esp. in more complex records with many record sub-schemas). So we've added a new method to ResolvingDecoder called readFieldOrderIfDiff, which is a variant of the existing readFieldOrder. If the field order has indeed changed, then readFieldOrderIfDiff returns the new field order, just like readFieldOrder does. However, if the field-order hasn't changed, then readFieldOrderIfDiff returns null. We then modified the generation of custom-decoders for records to add a special-case path that simply reads the record's fields in order, without incurring the overhead of the loop or the switch statement. (Our tests show this optimization improves performance by 8-9%, on top of the 35-40% produced by the original custom-coder optimization.) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (AVRO-2269) Improve usability of Perf.java
[ https://issues.apache.org/jira/browse/AVRO-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymie Stata updated AVRO-2269: --- Description: The class {{org.apache.avro.ipc.io.Perf}} is Avro's performance test suite. This JIRA aims to make it easier to use. Specifically: * Added a file {{performance-testing.html}} with guidance on how to use the suite * Added script {{run-script.sh}} that uses {{Perf}} to run structured experiments. * Added tests for performance of resolution of unchanged unions and enumerations, which will be subject to future optimizations. * Tweaks to {{Perf}} for better experimentation (e.g., support for minimum as well as average aggregation). was: In attempting to use Perf.java to show that proposed performance changes actually improved performance, different runs of Perf.java using the exact same code base resulted variances of 5% or greater – and often 10% or greater – for about half the test cases. With variance this high within a code base, it's impossible to tell if a proposed "improved" code base indeed improves performance. I will post to the wiki and elsewhere some documents and scripts I developed to reduce this variance. This JIRA is for changes to Perf.java that reduce the variance. Specifically: * Access the {{reader}} and {{writer}} instance variables directly in the inner-loop for {{SpecificTest}}, as well as switched to a "reuse" object for reading records, rather than constructing fresh objects for each read. Both helped to significantly reduce variance for {{FooBarSpecificRecordTestWrite}}, a major target of recent performance-improvement efforts. * Switched to {{DirectBinaryEncoder}} instead of {{BufferedBinaryEncoder}} for write tests. Although this slowed writer-tests a bit, it reduced variance a lot, especially for performance tests of primitives like booleans, making it a better choice for measuring the performance-impact of code changes. * Started the timer of a test after the encoder/decoder for the test is constructed, rather than before. Helps a little. * Added the ability to output the _minimum_ runtime of a test case across multiple cycles (vs the total runtime across all cycles). This was inspired by JVMSpec, which used to use a minimum. I was able to reduce the variance of total runtime enough to obviate the need for this metric, but since it's helpful diagnostically, I left it in. > Improve usability of Perf.java > -- > > Key: AVRO-2269 > URL: https://issues.apache.org/jira/browse/AVRO-2269 > Project: Apache Avro > Issue Type: Test > Components: java >Reporter: Raymie Stata >Assignee: Raymie Stata >Priority: Major > > The class {{org.apache.avro.ipc.io.Perf}} is Avro's performance test suite. > This JIRA aims to make it easier to use. Specifically: > * Added a file {{performance-testing.html}} with guidance on how to use the > suite > * Added script {{run-script.sh}} that uses {{Perf}} to run structured > experiments. > * Added tests for performance of resolution of unchanged unions and > enumerations, which will be subject to future optimizations. > * Tweaks to {{Perf}} for better experimentation (e.g., support for minimum as > well as average aggregation). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AVRO-2269) Improve usability of Perf.java
[ https://issues.apache.org/jira/browse/AVRO-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698096#comment-16698096 ] Raymie Stata commented on AVRO-2269: This work has changed direction. The focus shifted away from variance and towards usability. I've updated the subject and description accordingly. > Improve usability of Perf.java > -- > > Key: AVRO-2269 > URL: https://issues.apache.org/jira/browse/AVRO-2269 > Project: Apache Avro > Issue Type: Test > Components: java >Reporter: Raymie Stata >Assignee: Raymie Stata >Priority: Major > > In attempting to use Perf.java to show that proposed performance changes > actually improved performance, different runs of Perf.java using the exact > same code base resulted variances of 5% or greater – and often 10% or greater > – for about half the test cases. With variance this high within a code base, > it's impossible to tell if a proposed "improved" code base indeed improves > performance. I will post to the wiki and elsewhere some documents and scripts > I developed to reduce this variance. This JIRA is for changes to Perf.java > that reduce the variance. Specifically: > * Access the {{reader}} and {{writer}} instance variables directly in the > inner-loop for {{SpecificTest}}, as well as switched to a "reuse" object for > reading records, rather than constructing fresh objects for each read. Both > helped to significantly reduce variance for > {{FooBarSpecificRecordTestWrite}}, a major target of recent > performance-improvement efforts. > * Switched to {{DirectBinaryEncoder}} instead of {{BufferedBinaryEncoder}} > for write tests. Although this slowed writer-tests a bit, it reduced variance > a lot, especially for performance tests of primitives like booleans, making > it a better choice for measuring the performance-impact of code changes. > * Started the timer of a test after the encoder/decoder for the test is > constructed, rather than before. Helps a little. > * Added the ability to output the _minimum_ runtime of a test case across > multiple cycles (vs the total runtime across all cycles). This was inspired > by JVMSpec, which used to use a minimum. I was able to reduce the variance > of total runtime enough to obviate the need for this metric, but since it's > helpful diagnostically, I left it in. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (AVRO-2269) Improve usability of Perf.java
[ https://issues.apache.org/jira/browse/AVRO-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymie Stata updated AVRO-2269: --- Summary: Improve usability of Perf.java (was: Improve variances seen across Perf.java runs) > Improve usability of Perf.java > -- > > Key: AVRO-2269 > URL: https://issues.apache.org/jira/browse/AVRO-2269 > Project: Apache Avro > Issue Type: Test > Components: java >Reporter: Raymie Stata >Assignee: Raymie Stata >Priority: Major > > In attempting to use Perf.java to show that proposed performance changes > actually improved performance, different runs of Perf.java using the exact > same code base resulted variances of 5% or greater – and often 10% or greater > – for about half the test cases. With variance this high within a code base, > it's impossible to tell if a proposed "improved" code base indeed improves > performance. I will post to the wiki and elsewhere some documents and scripts > I developed to reduce this variance. This JIRA is for changes to Perf.java > that reduce the variance. Specifically: > * Access the {{reader}} and {{writer}} instance variables directly in the > inner-loop for {{SpecificTest}}, as well as switched to a "reuse" object for > reading records, rather than constructing fresh objects for each read. Both > helped to significantly reduce variance for > {{FooBarSpecificRecordTestWrite}}, a major target of recent > performance-improvement efforts. > * Switched to {{DirectBinaryEncoder}} instead of {{BufferedBinaryEncoder}} > for write tests. Although this slowed writer-tests a bit, it reduced variance > a lot, especially for performance tests of primitives like booleans, making > it a better choice for measuring the performance-impact of code changes. > * Started the timer of a test after the encoder/decoder for the test is > constructed, rather than before. Helps a little. > * Added the ability to output the _minimum_ runtime of a test case across > multiple cycles (vs the total runtime across all cycles). This was inspired > by JVMSpec, which used to use a minimum. I was able to reduce the variance > of total runtime enough to obviate the need for this metric, but since it's > helpful diagnostically, I left it in. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Reopened] (AVRO-1658) Add avroDoc on reflect
[ https://issues.apache.org/jira/browse/AVRO-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymie Stata reopened AVRO-1658: Assignee: Raymie Stata (was: Zhaonan Sun) The file {{AvroDoc.java}} doesn't have a license, cause the build to break (grumble). Will send a patch for this shortly. > Add avroDoc on reflect > -- > > Key: AVRO-1658 > URL: https://issues.apache.org/jira/browse/AVRO-1658 > Project: Apache Avro > Issue Type: New Feature > Components: java >Affects Versions: 1.7.7 >Reporter: Zhaonan Sun >Assignee: Raymie Stata >Priority: Major > Labels: reflection > Fix For: 1.9.0 > > Attachments: > 0001-AVRO-1658-Java-Add-reflection-annotation-AvroDoc.patch, > 0001-AVRO-1658-Java-Add-reflection-annotation-AvroDoc.patch, > 0001-AVRO-1658-Java-Add-reflection-annotation-AvroDoc.patch > > > Looks like @AvroMeta can't add reserved fields, like @AvroMeta("doc", "some > doc") will have exceptions. > I would be greate if we have a @AvroDoc("some documentations") in > org.apache.avro.reflect -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AVRO-2269) Improve variances seen across Perf.java runs
[ https://issues.apache.org/jira/browse/AVRO-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689164#comment-16689164 ] Raymie Stata commented on AVRO-2269: I took a look at JMH. I think it'd be great to convert `Perf.java` over to JMH. I didn't pursue it because I couldn't find good enough doc's on JMH to feel comfortable using it myself. The forthcoming patch I have for AVRO-2269 make changes that are orthogonal to what JMH does. JMH does things like warm up the JIT and various caches, and so forth, and it runs tests a dynamic number of times in order to "seek" stable statistics on performance metrics. The current `Perf.main` does some of this already – I didn't touch any of that code – but JMH seems to do a much more professional job of it. Thus, again, it'd be great to convert `Perf.java` to JMH. That said, while JMH might do a pretty good job of finding the "true" running time of a highly-variance piece of code, it doesn't turn a high-variance piece of code into a low-variance one. The forthcoming patch for AVRO-2269 do the latter – try to reduce the inherent variance of the tests (for example, by reducing the allocations done for `FooBarSpecificRecord` tests). JMH together with this forthcoming patch would be a great combination. A just submitted a pull request for AVRO-2268 containing a little bug fix that I want to depend upon, but which is pretty independent of the changes I have for AVRO-2269. If someone could pull AVRO-2268, I'd like to rebase onto that change before submitting the AVRO-2269 patch. > Improve variances seen across Perf.java runs > > > Key: AVRO-2269 > URL: https://issues.apache.org/jira/browse/AVRO-2269 > Project: Apache Avro > Issue Type: Test > Components: java >Reporter: Raymie Stata >Assignee: Raymie Stata >Priority: Major > > In attempting to use Perf.java to show that proposed performance changes > actually improved performance, different runs of Perf.java using the exact > same code base resulted variances of 5% or greater – and often 10% or greater > – for about half the test cases. With variance this high within a code base, > it's impossible to tell if a proposed "improved" code base indeed improves > performance. I will post to the wiki and elsewhere some documents and scripts > I developed to reduce this variance. This JIRA is for changes to Perf.java > that reduce the variance. Specifically: > * Access the {{reader}} and {{writer}} instance variables directly in the > inner-loop for {{SpecificTest}}, as well as switched to a "reuse" object for > reading records, rather than constructing fresh objects for each read. Both > helped to significantly reduce variance for > {{FooBarSpecificRecordTestWrite}}, a major target of recent > performance-improvement efforts. > * Switched to {{DirectBinaryEncoder}} instead of {{BufferedBinaryEncoder}} > for write tests. Although this slowed writer-tests a bit, it reduced variance > a lot, especially for performance tests of primitives like booleans, making > it a better choice for measuring the performance-impact of code changes. > * Started the timer of a test after the encoder/decoder for the test is > constructed, rather than before. Helps a little. > * Added the ability to output the _minimum_ runtime of a test case across > multiple cycles (vs the total runtime across all cycles). This was inspired > by JVMSpec, which used to use a minimum. I was able to reduce the variance > of total runtime enough to obviate the need for this metric, but since it's > helpful diagnostically, I left it in. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AVRO-2268) Perf.java SpecificRecord input data not working
[ https://issues.apache.org/jira/browse/AVRO-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689155#comment-16689155 ] Raymie Stata commented on AVRO-2268: My patch for AVRO-2269 assumes this fix is in place. I wanted to submit this patch separately because the issue is independent of AVRO-2269 and the problem should be fixed whether or not AVRO-2269 is accepted. > Perf.java SpecificRecord input data not working > --- > > Key: AVRO-2268 > URL: https://issues.apache.org/jira/browse/AVRO-2268 > Project: Apache Avro > Issue Type: Test > Components: java >Reporter: Raymie Stata >Assignee: Raymie Stata >Priority: Major > > In {{FooBarSpecificRecordTest.genSingleRecord}}, the {{nicknames}} field is > given an instance of what is returned by {{ArrayList.asList}}, which does > _not_ support the {{clear}} method. When reusing objects during a read, the > {{clear}} method is used to clear the contents of array-valued fields during > reading, which causes an {{OperationNotSupported}} exception. So > {{genSingleRecord}} needs to change to set {{nicknames}} to a type that > implements {{clear}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AVRO-2269) Improve variances seen across Perf.java runs
Raymie Stata created AVRO-2269: -- Summary: Improve variances seen across Perf.java runs Key: AVRO-2269 URL: https://issues.apache.org/jira/browse/AVRO-2269 Project: Apache Avro Issue Type: Test Components: java Reporter: Raymie Stata Assignee: Raymie Stata In attempting to use Perf.java to show that proposed performance changes actually improved performance, different runs of Perf.java using the exact same code base resulted variances of 5% or greater – and often 10% or greater – for about half the test cases. With variance this high within a code base, it's impossible to tell if a proposed "improved" code base indeed improves performance. I will post to the wiki and elsewhere some documents and scripts I developed to reduce this variance. This JIRA is for changes to Perf.java that reduce the variance. Specifically: * Access the {{reader}} and {{writer}} instance variables directly in the inner-loop for {{SpecificTest}}, as well as switched to a "reuse" object for reading records, rather than constructing fresh objects for each read. Both helped to significantly reduce variance for {{FooBarSpecificRecordTestWrite}}, a major target of recent performance-improvement efforts. * Switched to {{DirectBinaryEncoder}} instead of {{BufferedBinaryEncoder}} for write tests. Although this slowed writer-tests a bit, it reduced variance a lot, especially for performance tests of primitives like booleans, making it a better choice for measuring the performance-impact of code changes. * Started the timer of a test after the encoder/decoder for the test is constructed, rather than before. Helps a little. * Added the ability to output the _minimum_ runtime of a test case across multiple cycles (vs the total runtime across all cycles). This was inspired by JVMSpec, which used to use a minimum. I was able to reduce the variance of total runtime enough to obviate the need for this metric, but since it's helpful diagnostically, I left it in. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AVRO-2252) I'd like to improve Avro .NET (C#) library (many points)
[ https://issues.apache.org/jira/browse/AVRO-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677387#comment-16677387 ] Raymie Stata commented on AVRO-2252: As annoying as bad style can be, large-scale changes to a code base for the sake of stylistic improvement makes it difficult to navigate the history of a code base, because diffs of post-change code against pre-change code show a lot of spurious changes not relevant to the problem you're trying to track down. My suggestion is to focus on substantive changes for now (bug fixes, performance improvements, new features, etc). Make your stylistic changes right after the new release is shipped. This way, if any of the new substantive changes causes regressions, it will be easier to debug them (and fix them both on the release branch and on master). > I'd like to improve Avro .NET (C#) library (many points) > > > Key: AVRO-2252 > URL: https://issues.apache.org/jira/browse/AVRO-2252 > Project: Avro > Issue Type: Wish > Components: csharp >Reporter: Anton Ryzhov >Priority: Major > > Hello all, > The company where I'm working as a .NET developer is actively using Avro > format. > I'd like to improve Avro .NET (C#) library: > 1) clean-up the code: > - remove trailing spaces, unused namespace usings, etc. > - remove unused dependency of log4net library > - replace dependency of json library from direct reference to Nuget package > 2) format the code to unify code style everywhere in the library > - possibly using the Microsoft recommended code style for C# > [https://docs.microsoft.com/en-us/dotnet/csharp/programming-guide/inside-a-program/coding-conventions] > 3) use the latest C# 7.0 language features to make the code more compact and > readable > 4) make .NET 4.5 and .NET standard 2.0 versions of the library, keeping the > existing compatibility with the .NET 3.5 > - add asynchronous API to the .NET 4.5 and .NET standard 2.0 versions (async > methods along with the synchronous ones). > What do you think? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Reopened] (AVRO-2251) Modify Perf.java to better support automation scripts
[ https://issues.apache.org/jira/browse/AVRO-2251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymie Stata reopened AVRO-2251: As I continue to work on performance testing, I'm wanting to experiment with values for Perf.COUNT and Perf.CYCLES without having to recompile. An additional patch is forthcoming that allows for this. > Modify Perf.java to better support automation scripts > - > > Key: AVRO-2251 > URL: https://issues.apache.org/jira/browse/AVRO-2251 > Project: Avro > Issue Type: Test >Reporter: Raymie Stata >Assignee: Raymie Stata >Priority: Major > Fix For: 1.9.0 > > > To better support automated performance-test suites, this patch proposes two > new arguments to the 'Perf.java' command-line tool: > The `-o' argument gives 'Perf.java' the name of a file that should get the > results of the run. Currently, Perf.java sends output to System.out – but if > 'Perf.java' is invoked using Maven, which is the easiest way to invoke it, > then System.out is polluted with a bunch of other output. Redirecting > 'Perf.java' output metrics to a file makes it easier for automation scripts > to process those metrics. > The `-c [spec]` argument tells 'Perf.java' to generate a comma-separated > output. By default, all benchmark metrics are output, but the optional > `spec` argument can be used to indicate exactly which metrics should be > included in the CSV output. The default output of 'Perf.java' is optimized > for human inspection – for example, it includes the text "ms" in the > running-time column so humans will understand the units of the running-time > metric. The `-c` option will tell 'Perf.java' to generate machine-optimized > output that is easier to consume by automation scripts. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AVRO-2251) Modify Perf.java to better support automation scripts
Raymie Stata created AVRO-2251: -- Summary: Modify Perf.java to better support automation scripts Key: AVRO-2251 URL: https://issues.apache.org/jira/browse/AVRO-2251 Project: Avro Issue Type: Test Reporter: Raymie Stata Assignee: Raymie Stata To better support automated performance-test suites, this patch proposes two new arguments to the 'Perf.java' command-line tool: The `-o' argument gives 'Perf.java' the name of a file that should get the results of the run. Currently, Perf.java sends output to System.out – but if 'Perf.java' is invoked using Maven, which is the easiest way to invoke it, then System.out is polluted with a bunch of other output. Redirecting 'Perf.java' output metrics to a file makes it easier for automation scripts to process those metrics. The `-c [spec]` argument tells 'Perf.java' to generate a comma-separated output. By default, all benchmark metrics are output, but the optional `spec` argument can be used to indicate exactly which metrics should be included in the CSV output. The default output of 'Perf.java' is optimized for human inspection – for example, it includes the text "ms" in the running-time column so humans will understand the units of the running-time metric. The `-c` option will tell 'Perf.java' to generate machine-optimized output that is easier to consume by automation scripts. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (AVRO-1022) Error in validate name
[ https://issues.apache.org/jira/browse/AVRO-1022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymie Stata reassigned AVRO-1022: -- Assignee: Raymie Stata > Error in validate name > -- > > Key: AVRO-1022 > URL: https://issues.apache.org/jira/browse/AVRO-1022 > Project: Avro > Issue Type: Bug > Components: java >Reporter: Raymie Stata >Assignee: Raymie Stata >Priority: Minor > Attachments: AVRO-1022.patch, AVRO-1022.patch, > unicode-recommendation.html > > > Fix schema.validateName to allow only ASCII letters, not Unicode letters. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (AVRO-419) Consistent laziness when resolving partially-compatible changes
[ https://issues.apache.org/jira/browse/AVRO-419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymie Stata reassigned AVRO-419: - Assignee: Raymie Stata > Consistent laziness when resolving partially-compatible changes > --- > > Key: AVRO-419 > URL: https://issues.apache.org/jira/browse/AVRO-419 > Project: Avro > Issue Type: Bug > Components: spec >Reporter: Raymie Stata >Assignee: Raymie Stata >Priority: Major > > Avro schema resolution is generally "lazy" when it comes to dealing with > incompatible changes. If the writer writes a union of "int" and "null", and > the reader expects just an "int", Avro doesn't raise an exception unless the > writer _actually_ writes a "null" (and the reader attempts to read it). > This laziness is a powerful feature for supporting "forward compatibility" > (old readers reading data written by new writers). In the example just > given, for example, we might decide at some point that a column needs to be > "nullable" but there's a lot of old code that assumes that it's not. When > using old code, we can often arrange to avoid sending the old code any new > records that have null-values in that column. It's powerful to allow new > writers to write against the nullable schema and allow readers to read those > records. (For this to be safe, it's also important that this be _checked,_ > i.e., that a run time error is thrown is a bad value is passed to the reader.) > Avro is lazy in many places (e.g., in the union example just given, and for > enumerations). But it's not _consistently_ lazy. I propose we comb through > the spec and make it lazy in all places we can, unless there's a compelling > reason not to. > Numeric types is one area where Avro is not consistently lazy. I propose > that we fairly liberally allow any change from one numeric type to another, > and raise errors at runtime if bad values are found. An "int" can be changed > to a "long", for example, and an error is raised when a reader gets an > out-of-bounds value. A "double" can be changed to an "int", and an error is > raised if the reader gets a non-integer value or an out-of-bounds value. > (I'm not sure if there are types beyond numerics where we could be more > consistently lazy, but I decided to write this issue generically just in > case.) > (One might object that these checks are expensive, but note that they are > only needed when the reader and writer specs don't agree. Thus, if these > checks are induced, then the system designer _wanted_ these checks, we're > only adding value here, not inducing costs.) > I'm not sure if there are other a -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (AVRO-419) Consistent laziness when resolving partially-compatible changes
[ https://issues.apache.org/jira/browse/AVRO-419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymie Stata resolved AVRO-419. --- Resolution: Won't Fix This is ancient history: will not fix. > Consistent laziness when resolving partially-compatible changes > --- > > Key: AVRO-419 > URL: https://issues.apache.org/jira/browse/AVRO-419 > Project: Avro > Issue Type: Bug > Components: spec >Reporter: Raymie Stata >Priority: Major > > Avro schema resolution is generally "lazy" when it comes to dealing with > incompatible changes. If the writer writes a union of "int" and "null", and > the reader expects just an "int", Avro doesn't raise an exception unless the > writer _actually_ writes a "null" (and the reader attempts to read it). > This laziness is a powerful feature for supporting "forward compatibility" > (old readers reading data written by new writers). In the example just > given, for example, we might decide at some point that a column needs to be > "nullable" but there's a lot of old code that assumes that it's not. When > using old code, we can often arrange to avoid sending the old code any new > records that have null-values in that column. It's powerful to allow new > writers to write against the nullable schema and allow readers to read those > records. (For this to be safe, it's also important that this be _checked,_ > i.e., that a run time error is thrown is a bad value is passed to the reader.) > Avro is lazy in many places (e.g., in the union example just given, and for > enumerations). But it's not _consistently_ lazy. I propose we comb through > the spec and make it lazy in all places we can, unless there's a compelling > reason not to. > Numeric types is one area where Avro is not consistently lazy. I propose > that we fairly liberally allow any change from one numeric type to another, > and raise errors at runtime if bad values are found. An "int" can be changed > to a "long", for example, and an error is raised when a reader gets an > out-of-bounds value. A "double" can be changed to an "int", and an error is > raised if the reader gets a non-integer value or an out-of-bounds value. > (I'm not sure if there are types beyond numerics where we could be more > consistently lazy, but I decided to write this issue generically just in > case.) > (One might object that these checks are expensive, but note that they are > only needed when the reader and writer specs don't agree. Thus, if these > checks are induced, then the system designer _wanted_ these checks, we're > only adding value here, not inducing costs.) > I'm not sure if there are other a -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (AVRO-2244) Problems with TestSpecificLogicalTypes.testAbilityToReadJsr310RecordWrittenAsJodaRecord:148
[ https://issues.apache.org/jira/browse/AVRO-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661817#comment-16661817 ] Raymie Stata edited comment on AVRO-2244 at 10/24/18 6:51 AM: -- {{If there's any doubt about this issue being resolved, I just got the following error:}} {{ testAbilityToReadJsr310RecordWrittenAsJodaRecord(org.apache.avro.specific.TestSpecificLogicalTypes) Time elapsed: 0.085 sec <<< FAILURE!}} {{ java.lang.AssertionError:}}{{Expected: is "23:43:30.800"}} {{ but: was "23:43:30.8"}} {{ at org.apache.avro.specific.TestSpecificLogicalTypes.testAbilityToReadJsr310RecordWrittenAsJodaRecord(TestSpecificLogicalTypes.java:150)}} {{ Personally, I would revert AVRO-2241 and figure out how to get `TestSpecificLogicalTypes.testAbilityToReadJsr310RecordWrittenAsJodaRecord` to output zero-padded, three-digit time stamps for the Jsr310 case.}} was (Author: raymie): If there's any doubt about this issue being resolved, I just got the following error: ``` testAbilityToReadJsr310RecordWrittenAsJodaRecord(org.apache.avro.specific.TestSpecificLogicalTypes) Time elapsed: 0.085 sec <<< FAILURE! java.lang.AssertionError: Expected: is "23:43:30.800" but: was "23:43:30.8" at org.apache.avro.specific.TestSpecificLogicalTypes.testAbilityToReadJsr310RecordWrittenAsJodaRecord(TestSpecificLogicalTypes.java:150) ``` Personally, I would revert AVRO-2241 and figure out how to get `TestSpecificLogicalTypes.testAbilityToReadJsr310RecordWrittenAsJodaRecord` to output zero-padded, three-digit time stamps for the Jsr310 case. > Problems with > TestSpecificLogicalTypes.testAbilityToReadJsr310RecordWrittenAsJodaRecord:148 > --- > > Key: AVRO-2244 > URL: https://issues.apache.org/jira/browse/AVRO-2244 > Project: Avro > Issue Type: Bug > Components: logical types >Reporter: Raymie Stata >Priority: Major > > I've seen an intermittent test failure that looks like this: > {{Failed tests:}} > {{ > TestSpecificLogicalTypes.testAbilityToReadJsr310RecordWrittenAsJodaRecord:148}} > {{Expected: is "20:35:18.720"}} > {{ but: was "20:35:18.72"}} > When I see this failure, it's always the case that the trailing digit is > zero. I suspect that it's a bug where the trailing zero is not printed. > Since the test cases use the current time, then most of the time the trailing > digit isn't zero and the bug isn't tickled. But once-in-a-while the current > time has a trailing zero, which tickles the bug. > If this diagnosis is correct, then in addition to fixing the bug, it might be > a good idea to add tests with hard-wired, static times that cover corner > cases like this one. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AVRO-2244) Problems with TestSpecificLogicalTypes.testAbilityToReadJsr310RecordWrittenAsJodaRecord:148
[ https://issues.apache.org/jira/browse/AVRO-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661817#comment-16661817 ] Raymie Stata commented on AVRO-2244: If there's any doubt about this issue being resolved, I just got the following error: ``` testAbilityToReadJsr310RecordWrittenAsJodaRecord(org.apache.avro.specific.TestSpecificLogicalTypes) Time elapsed: 0.085 sec <<< FAILURE! java.lang.AssertionError: Expected: is "23:43:30.800" but: was "23:43:30.8" at org.apache.avro.specific.TestSpecificLogicalTypes.testAbilityToReadJsr310RecordWrittenAsJodaRecord(TestSpecificLogicalTypes.java:150) ``` Personally, I would revert AVRO-2241 and figure out how to get `TestSpecificLogicalTypes.testAbilityToReadJsr310RecordWrittenAsJodaRecord` to output zero-padded, three-digit time stamps for the Jsr310 case. > Problems with > TestSpecificLogicalTypes.testAbilityToReadJsr310RecordWrittenAsJodaRecord:148 > --- > > Key: AVRO-2244 > URL: https://issues.apache.org/jira/browse/AVRO-2244 > Project: Avro > Issue Type: Bug > Components: logical types >Reporter: Raymie Stata >Priority: Major > > I've seen an intermittent test failure that looks like this: > {{Failed tests:}} > {{ > TestSpecificLogicalTypes.testAbilityToReadJsr310RecordWrittenAsJodaRecord:148}} > {{Expected: is "20:35:18.720"}} > {{ but: was "20:35:18.72"}} > When I see this failure, it's always the case that the trailing digit is > zero. I suspect that it's a bug where the trailing zero is not printed. > Since the test cases use the current time, then most of the time the trailing > digit isn't zero and the bug isn't tickled. But once-in-a-while the current > time has a trailing zero, which tickles the bug. > If this diagnosis is correct, then in addition to fixing the bug, it might be > a good idea to add tests with hard-wired, static times that cover corner > cases like this one. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AVRO-2244) Problems with TestSpecificLogicalTypes.testAbilityToReadJsr310RecordWrittenAsJodaRecord:148
[ https://issues.apache.org/jira/browse/AVRO-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661123#comment-16661123 ] Raymie Stata commented on AVRO-2244: I don't believe the fix for AVRO-2241 addresses the problem in AVRO-2244: 2244 seems to be related to the _formatting_ of times, rather than the truncation of them. However, I think the reverse is true: A fix to AVRO-2244 would (have) addressed the problem seen in AVRO-2241. > Problems with > TestSpecificLogicalTypes.testAbilityToReadJsr310RecordWrittenAsJodaRecord:148 > --- > > Key: AVRO-2244 > URL: https://issues.apache.org/jira/browse/AVRO-2244 > Project: Avro > Issue Type: Bug > Components: logical types >Reporter: Raymie Stata >Priority: Major > > I've seen an intermittent test failure that looks like this: > {{Failed tests:}} > {{ > TestSpecificLogicalTypes.testAbilityToReadJsr310RecordWrittenAsJodaRecord:148}} > {{Expected: is "20:35:18.720"}} > {{ but: was "20:35:18.72"}} > When I see this failure, it's always the case that the trailing digit is > zero. I suspect that it's a bug where the trailing zero is not printed. > Since the test cases use the current time, then most of the time the trailing > digit isn't zero and the bug isn't tickled. But once-in-a-while the current > time has a trailing zero, which tickles the bug. > If this diagnosis is correct, then in addition to fixing the bug, it might be > a good idea to add tests with hard-wired, static times that cover corner > cases like this one. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AVRO-2090) Improve encode/decode time for SpecificRecord using code generation
[ https://issues.apache.org/jira/browse/AVRO-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16658852#comment-16658852 ] Raymie Stata commented on AVRO-2090: Any more feedback on this patch? > Improve encode/decode time for SpecificRecord using code generation > --- > > Key: AVRO-2090 > URL: https://issues.apache.org/jira/browse/AVRO-2090 > Project: Avro > Issue Type: Improvement > Components: java >Reporter: Raymie Stata >Assignee: Raymie Stata >Priority: Major > Attachments: customcoders.md, perf-data.txt > > > New implementation for generation of code for SpecificRecord that improves > decoding by over 10% and encoding over 30% (more improvements are on the > way). This feature is behind a feature flag > ({{org.apache.avro.specific.use_custom_coders}}) and (for now) turned off by > default. See [Getting Started > (Java)|https://avro.apache.org/docs/current/gettingstartedjava.html#Beta+feature:+Generating+faster+code] > for instructions. > (A bit more info: Compared to GenericRecords, SpecificRecords offer > type-safety plus the performance of traditional getters/setters/instance > variables. But these are only beneficial to Java code accessing those > records. SpecificRecords inherit serialization and deserialization code from > GenericRecords, which is dynamic and thus slow (in fact, benchmarks show that > serialization and deserialization is _slower_ for SpecificRecord than for > GenericRecord). This patch extends record.vm to generate custom, > higher-performance encoder and decoder functions for SpecificRecords.) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (AVRO-2090) Improve encode/decode time for SpecificRecord using code generation
[ https://issues.apache.org/jira/browse/AVRO-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymie Stata updated AVRO-2090: --- Description: New implementation for generation of code for SpecificRecord that improves decoding by over 10% and encoding over 30% (more improvements are on the way). This feature is behind a feature flag ({{org.apache.avro.specific.use_custom_coders}}) and (for now) turned off by default. See [Getting Started (Java)|https://avro.apache.org/docs/current/gettingstartedjava.html#Beta+feature:+Generating+faster+code] for instructions. (A bit more info: Compared to GenericRecords, SpecificRecords offer type-safety plus the performance of traditional getters/setters/instance variables. But these are only beneficial to Java code accessing those records. SpecificRecords inherit serialization and deserialization code from GenericRecords, which is dynamic and thus slow (in fact, benchmarks show that serialization and deserialization is _slower_ for SpecificRecord than for GenericRecord). This patch extends record.vm to generate custom, higher-performance encoder and decoder functions for SpecificRecords.) was: New implementation for generation of code for SpecificRecord that improves decoding by over 10% and encoding over 30% (more improvements are on the way). This feature is behind a feature flag ({{org.apache.avro.specific.use_custom_coders}}) and (for now) turned off by default. See [Getting Started (Java)|https://avro.apache.org/docs/current/gettingstartedjava.html#Beta+feature:+Generating faster+code] for instructions. (A bit more info: Compared to GenericRecords, SpecificRecords offer type-safety plus the performance of traditional getters/setters/instance variables. But these are only beneficial to Java code accessing those records. SpecificRecords inherit serialization and deserialization code from GenericRecords, which is dynamic and thus slow (in fact, benchmarks show that serialization and deserialization is _slower_ for SpecificRecord than for GenericRecord). This patch extends record.vm to generate custom, higher-performance encoder and decoder functions for SpecificRecords.) > Improve encode/decode time for SpecificRecord using code generation > --- > > Key: AVRO-2090 > URL: https://issues.apache.org/jira/browse/AVRO-2090 > Project: Avro > Issue Type: Improvement > Components: java >Reporter: Raymie Stata >Assignee: Raymie Stata >Priority: Major > Attachments: customcoders.md, perf-data.txt > > > New implementation for generation of code for SpecificRecord that improves > decoding by over 10% and encoding over 30% (more improvements are on the > way). This feature is behind a feature flag > ({{org.apache.avro.specific.use_custom_coders}}) and (for now) turned off by > default. See [Getting Started > (Java)|https://avro.apache.org/docs/current/gettingstartedjava.html#Beta+feature:+Generating+faster+code] > for instructions. > (A bit more info: Compared to GenericRecords, SpecificRecords offer > type-safety plus the performance of traditional getters/setters/instance > variables. But these are only beneficial to Java code accessing those > records. SpecificRecords inherit serialization and deserialization code from > GenericRecords, which is dynamic and thus slow (in fact, benchmarks show that > serialization and deserialization is _slower_ for SpecificRecord than for > GenericRecord). This patch extends record.vm to generate custom, > higher-performance encoder and decoder functions for SpecificRecords.) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (AVRO-2090) Improve encode/decode time for SpecificRecord using code generation
[ https://issues.apache.org/jira/browse/AVRO-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymie Stata updated AVRO-2090: --- Description: New implementation for generation of code for SpecificRecord that improves decoding by over 10% and encoding over 30% (more improvements are on the way). This feature is behind a feature flag ({{org.apache.avro.specific.use_custom_coders}}) and (for now) turned off by default. See [Getting Started (Java)|https://avro.apache.org/docs/current/gettingstartedjava.html#Beta+feature:+Generating faster+code] for instructions. (A bit more info: Compared to GenericRecords, SpecificRecords offer type-safety plus the performance of traditional getters/setters/instance variables. But these are only beneficial to Java code accessing those records. SpecificRecords inherit serialization and deserialization code from GenericRecords, which is dynamic and thus slow (in fact, benchmarks show that serialization and deserialization is _slower_ for SpecificRecord than for GenericRecord). This patch extends record.vm to generate custom, higher-performance encoder and decoder functions for SpecificRecords.) was: New implementation for generation of code for SpecificRecord that improves decoding by over 10% and encoding over 30% (more improvements are on the way). This feature is behind a feature flag ({{org.apache.avro.specific.use_custom_coders}}) and (for now) turned off by default. See [Getting Started (Java)|https://avro.apache.org/docs/current/gettingstartedjava.html#Beta+feature:+Faster+code+generation] for instructions. (A bit more info: Compared to GenericRecords, SpecificRecords offer type-safety plus the performance of traditional getters/setters/instance variables. But these are only beneficial to Java code accessing those records. SpecificRecords inherit serialization and deserialization code from GenericRecords, which is dynamic and thus slow (in fact, benchmarks show that serialization and deserialization is _slower_ for SpecificRecord than for GenericRecord). This patch extends record.vm to generate custom, higher-performance encoder and decoder functions for SpecificRecords.) > Improve encode/decode time for SpecificRecord using code generation > --- > > Key: AVRO-2090 > URL: https://issues.apache.org/jira/browse/AVRO-2090 > Project: Avro > Issue Type: Improvement > Components: java >Reporter: Raymie Stata >Assignee: Raymie Stata >Priority: Major > Attachments: customcoders.md, perf-data.txt > > > New implementation for generation of code for SpecificRecord that improves > decoding by over 10% and encoding over 30% (more improvements are on the > way). This feature is behind a feature flag > ({{org.apache.avro.specific.use_custom_coders}}) and (for now) turned off by > default. See [Getting Started > (Java)|https://avro.apache.org/docs/current/gettingstartedjava.html#Beta+feature:+Generating > faster+code] for instructions. > (A bit more info: Compared to GenericRecords, SpecificRecords offer > type-safety plus the performance of traditional getters/setters/instance > variables. But these are only beneficial to Java code accessing those > records. SpecificRecords inherit serialization and deserialization code from > GenericRecords, which is dynamic and thus slow (in fact, benchmarks show that > serialization and deserialization is _slower_ for SpecificRecord than for > GenericRecord). This patch extends record.vm to generate custom, > higher-performance encoder and decoder functions for SpecificRecords.) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (AVRO-2090) Improve encode/decode time for SpecificRecord using code generation
[ https://issues.apache.org/jira/browse/AVRO-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymie Stata updated AVRO-2090: --- Description: New implementation for generation of code for SpecificRecord that improves decoding by over 10% and encoding over 30% (more improvements are on the way). This feature is behind a feature flag ({{org.apache.avro.specific.use_custom_coders}}) and (for now) turned off by default. See [Getting Started (Java)|https://avro.apache.org/docs/current/gettingstartedjava.html#Beta+feature:+Faster+code+generation] for instructions. (A bit more info: Compared to GenericRecords, SpecificRecords offer type-safety plus the performance of traditional getters/setters/instance variables. But these are only beneficial to Java code accessing those records. SpecificRecords inherit serialization and deserialization code from GenericRecords, which is dynamic and thus slow (in fact, benchmarks show that serialization and deserialization is _slower_ for SpecificRecord than for GenericRecord). This patch extends record.vm to generate custom, higher-performance encoder and decoder functions for SpecificRecords.) was: Compared to GenericRecords, SpecificRecords offer type-safety plus the performance of traditional getters/setters/instance variables. But these are only beneficial to Java code accessing those records. SpecificRecords inherit serialization and deserialization code from GenericRecords, which is dynamic and thus slow (in fact, benchmarks show that serialization and deserialization is _slower_ for SpecificRecord than for GenericRecord). This patch extends record.vm to generate custom, higher-performance encoder and decoder functions for SpecificRecords. We've run a public benchmark showing that the new code reduces serialization time by 2/3 and deserialization time by close to 50%. > Improve encode/decode time for SpecificRecord using code generation > --- > > Key: AVRO-2090 > URL: https://issues.apache.org/jira/browse/AVRO-2090 > Project: Avro > Issue Type: Improvement > Components: java >Reporter: Raymie Stata >Assignee: Raymie Stata >Priority: Major > Attachments: customcoders.md, perf-data.txt > > > New implementation for generation of code for SpecificRecord that improves > decoding by over 10% and encoding over 30% (more improvements are on the > way). This feature is behind a feature flag > ({{org.apache.avro.specific.use_custom_coders}}) and (for now) turned off by > default. See [Getting Started > (Java)|https://avro.apache.org/docs/current/gettingstartedjava.html#Beta+feature:+Faster+code+generation] > for instructions. > (A bit more info: Compared to GenericRecords, SpecificRecords offer > type-safety plus the performance of traditional getters/setters/instance > variables. But these are only beneficial to Java code accessing those > records. SpecificRecords inherit serialization and deserialization code from > GenericRecords, which is dynamic and thus slow (in fact, benchmarks show that > serialization and deserialization is _slower_ for SpecificRecord than for > GenericRecord). This patch extends record.vm to generate custom, > higher-performance encoder and decoder functions for SpecificRecords.) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AVRO-2244) Problems with TestSpecificLogicalTypes.testAbilityToReadJsr310RecordWrittenAsJodaRecord:148
[ https://issues.apache.org/jira/browse/AVRO-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16656687#comment-16656687 ] Raymie Stata commented on AVRO-2244: The spec for [ISO_LOCAL_TIME|https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html#ISO_LOCAL_TIME]: "One to nine digits for the nano-of-second. As many digits will be output as required." So it's going to drop the trailing zeros. The spec for the [JODA formatter|https://www.joda.org/joda-time/apidocs/org/joda/time/format/ISODateTimeFormat.html#time--] that's being used here says: "Returns a formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, three digit fraction of second, and time zone offset (HH:mm:ss.SSSZZ).", i.e., it will pad with trailing zeros. So the test code in this case is buggy. > Problems with > TestSpecificLogicalTypes.testAbilityToReadJsr310RecordWrittenAsJodaRecord:148 > --- > > Key: AVRO-2244 > URL: https://issues.apache.org/jira/browse/AVRO-2244 > Project: Avro > Issue Type: Bug > Components: logical types >Reporter: Raymie Stata >Priority: Major > > I've seen an intermittent test failure that looks like this: > {{Failed tests:}} > {{ > TestSpecificLogicalTypes.testAbilityToReadJsr310RecordWrittenAsJodaRecord:148}} > {{Expected: is "20:35:18.720"}} > {{ but: was "20:35:18.72"}} > When I see this failure, it's always the case that the trailing digit is > zero. I suspect that it's a bug where the trailing zero is not printed. > Since the test cases use the current time, then most of the time the trailing > digit isn't zero and the bug isn't tickled. But once-in-a-while the current > time has a trailing zero, which tickles the bug. > If this diagnosis is correct, then in addition to fixing the bug, it might be > a good idea to add tests with hard-wired, static times that cover corner > cases like this one. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (AVRO-2090) Improve encode/decode time for SpecificRecord using code generation
[ https://issues.apache.org/jira/browse/AVRO-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16654739#comment-16654739 ] Raymie Stata edited comment on AVRO-2090 at 10/18/18 7:19 AM: -- I've attached my two runs of Perf.java combined into a single file ([^perf-data.txt]). The first four columns of numbers in this file are the results with custom-encoders turned off; the next four columns are the results with custom-encoders on. For the two SpecificRecord cases: On my machine, FooBarSpecificRecordTestWrite improved 36% (from 3577 ms to 2296 ms), while FooBarSpecificRecordTestRead improved 12% (4728 ms to 4130 ms). It's not surprising that the read case improved less: the overhead of accommodating schema migration is high. I have some ideas on how improve performance even more, esp. for the read case. That said, a >10% improvement is not bad, and 36% improvement is quite good, so I suggest we commit this change as-is and save further improvements to future patches. (Thiru points out that FooBarSpecificRecord a very small class that probably understates the performance-improvements of this patch. In our work at Aqfer, we've seen larger improvements.) was (Author: raymie): I've attached my two runs of Perf.java combined into a single file ([^perf-data.txt]). The first four columns of numbers in this file are the results with custom-encoders turned off; the next four columns are the results with custom-encoders on. For the two SpecificRecord cases: On my machine, FooBarSpecificRecordTestWrite improved 36% (from 3577 ms to 2296 ms), while FooBarSpecificRecordTestRead improved 12% (4728 ms to 4130 ms). It's not surprising that the read case improved less: the overhead of accommodating schema migration is high. I have some ideas on how improve performance even more, esp. for the read case. That said, a >10% improvement is not bad, and 36% improvement is quite good, so I suggest we commit this change as-is and save further improvements to future patches. > Improve encode/decode time for SpecificRecord using code generation > --- > > Key: AVRO-2090 > URL: https://issues.apache.org/jira/browse/AVRO-2090 > Project: Avro > Issue Type: Improvement > Components: java >Reporter: Raymie Stata >Assignee: Raymie Stata >Priority: Major > Attachments: customcoders.md, perf-data.txt > > > Compared to GenericRecords, SpecificRecords offer type-safety plus the > performance of traditional getters/setters/instance variables. But these are > only beneficial to Java code accessing those records. SpecificRecords > inherit serialization and deserialization code from GenericRecords, which is > dynamic and thus slow (in fact, benchmarks show that serialization and > deserialization is _slower_ for SpecificRecord than for GenericRecord). > This patch extends record.vm to generate custom, higher-performance encoder > and decoder functions for SpecificRecords. We've run a public benchmark > showing that the new code reduces serialization time by 2/3 and > deserialization time by close to 50%. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (AVRO-2235) Regenerate TestRecordWithLogicalTypes
[ https://issues.apache.org/jira/browse/AVRO-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymie Stata resolved AVRO-2235. Resolution: Won't Fix Release Note: Based on Thiru's input, the old generate code should stay in place because it's there to test backward compatibility. > Regenerate TestRecordWithLogicalTypes > - > > Key: AVRO-2235 > URL: https://issues.apache.org/jira/browse/AVRO-2235 > Project: Avro > Issue Type: Bug > Components: logical types >Reporter: Raymie Stata >Assignee: Raymie Stata >Priority: Major > > TestRecordWithLogicalTypes.java is code that was generated by the specific > compiler and then moved into the testing code tree. It hasn't been changed > in a while, although the compiler is evolving. I tried to regenerate it and > found there is a problem with record_with_logical_types.avsc. I will fix the > schema file and then regenerate TestRecordWithLogicalTypes and check both in. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AVRO-2244) Problems with TestSpecificLogicalTypes.testAbilityToReadJsr310RecordWrittenAsJodaRecord:148
Raymie Stata created AVRO-2244: -- Summary: Problems with TestSpecificLogicalTypes.testAbilityToReadJsr310RecordWrittenAsJodaRecord:148 Key: AVRO-2244 URL: https://issues.apache.org/jira/browse/AVRO-2244 Project: Avro Issue Type: Bug Components: logical types Reporter: Raymie Stata I've seen an intermittent test failure that looks like this: {{Failed tests:}} {{ TestSpecificLogicalTypes.testAbilityToReadJsr310RecordWrittenAsJodaRecord:148}} {{Expected: is "20:35:18.720"}} {{ but: was "20:35:18.72"}} When I see this failure, it's always the case that the trailing digit is zero. I suspect that it's a bug where the trailing zero is not printed. Since the test cases use the current time, then most of the time the trailing digit isn't zero and the bug isn't tickled. But once-in-a-while the current time has a trailing zero, which tickles the bug. If this diagnosis is correct, then in addition to fixing the bug, it might be a good idea to add tests with hard-wired, static times that cover corner cases like this one. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AVRO-2235) Regenerate TestRecordWithLogicalTypes
[ https://issues.apache.org/jira/browse/AVRO-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637359#comment-16637359 ] Raymie Stata commented on AVRO-2235: I'm going to keep this open in case someone else wants to comment. But based on what Thiru says, I think I'm going to close this issue as "won't fix." > Regenerate TestRecordWithLogicalTypes > - > > Key: AVRO-2235 > URL: https://issues.apache.org/jira/browse/AVRO-2235 > Project: Avro > Issue Type: Bug > Components: logical types >Reporter: Raymie Stata >Assignee: Raymie Stata >Priority: Major > > TestRecordWithLogicalTypes.java is code that was generated by the specific > compiler and then moved into the testing code tree. It hasn't been changed > in a while, although the compiler is evolving. I tried to regenerate it and > found there is a problem with record_with_logical_types.avsc. I will fix the > schema file and then regenerate TestRecordWithLogicalTypes and check both in. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AVRO-2235) Regenerate TestRecordWithLogicalTypes
[ https://issues.apache.org/jira/browse/AVRO-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16636484#comment-16636484 ] Raymie Stata commented on AVRO-2235: I looked at the comment at the top of TestSpecificLogicalTypes and realized that maybe I shouldn't have updated TestRecordWithLogicalTypes. That comment reads: {quote}The classes [TestRecordWithLogicalTypes and TestRecordWithoutLogicalTypes]should not be re-generated because they test compatibility of Avro with existing Avro-generated sources. When using classes generated before AVRO-1684, logical types should not be applied by the read or write paths. Those files should behave as they did before. At the same time, [~nkollar] suggested in my (GitHub) pull request for AVRO-2090 that I regenerate TestRecordForLogicalTypes. {quote} So it sounds like this code here is for testing backward compatibility and thus shouldn't be updated. At the same time, in my (GitHub) pull request for AVRO-2090, [~nkollar] suggests that I _do_ regenerate these classes. At this point, I'm not sure what is the right thing to do. Any suggestions? > Regenerate TestRecordWithLogicalTypes > - > > Key: AVRO-2235 > URL: https://issues.apache.org/jira/browse/AVRO-2235 > Project: Avro > Issue Type: Bug > Components: logical types >Reporter: Raymie Stata >Assignee: Raymie Stata >Priority: Major > > TestRecordWithLogicalTypes.java is code that was generated by the specific > compiler and then moved into the testing code tree. It hasn't been changed > in a while, although the compiler is evolving. I tried to regenerate it and > found there is a problem with record_with_logical_types.avsc. I will fix the > schema file and then regenerate TestRecordWithLogicalTypes and check both in. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AVRO-2235) Regenerate TestRecordWithLogicalTypes
[ https://issues.apache.org/jira/browse/AVRO-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16636466#comment-16636466 ] Raymie Stata commented on AVRO-2235: I want to re-regenerate TestRecordForLogicalTypes.java with my modifications to the specific compiler, but would prefer to do that after successfully re-generating it on the master. > Regenerate TestRecordWithLogicalTypes > - > > Key: AVRO-2235 > URL: https://issues.apache.org/jira/browse/AVRO-2235 > Project: Avro > Issue Type: Bug > Components: logical types >Reporter: Raymie Stata >Assignee: Raymie Stata >Priority: Major > > TestRecordWithLogicalTypes.java is code that was generated by the specific > compiler and then moved into the testing code tree. It hasn't been changed > in a while, although the compiler is evolving. I tried to regenerate it and > found there is a problem with record_with_logical_types.avsc. I will fix the > schema file and then regenerate TestRecordWithLogicalTypes and check both in. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AVRO-2235) Regenerate TestRecordWithLogicalTypes
[ https://issues.apache.org/jira/browse/AVRO-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16636464#comment-16636464 ] Raymie Stata commented on AVRO-2235: Does anyone know how TestRecordWithoutLogicalTypes.java was generated? Was it a hand-edit of the generated code for TestRecordWithLogicalTypes.java? > Regenerate TestRecordWithLogicalTypes > - > > Key: AVRO-2235 > URL: https://issues.apache.org/jira/browse/AVRO-2235 > Project: Avro > Issue Type: Bug > Components: logical types >Reporter: Raymie Stata >Assignee: Raymie Stata >Priority: Major > > TestRecordWithLogicalTypes.java is code that was generated by the specific > compiler and then moved into the testing code tree. It hasn't been changed > in a while, although the compiler is evolving. I tried to regenerate it and > found there is a problem with record_with_logical_types.avsc. I will fix the > schema file and then regenerate TestRecordWithLogicalTypes and check both in. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AVRO-2235) Regenerate TestRecordWithLogicalTypes
Raymie Stata created AVRO-2235: -- Summary: Regenerate TestRecordWithLogicalTypes Key: AVRO-2235 URL: https://issues.apache.org/jira/browse/AVRO-2235 Project: Avro Issue Type: Bug Components: logical types Reporter: Raymie Stata Assignee: Raymie Stata TestRecordWithLogicalTypes.java is code that was generated by the specific compiler and then moved into the testing code tree. It hasn't been changed in a while, although the compiler is evolving. I tried to regenerate it and found there is a problem with record_with_logical_types.avsc. I will fix the schema file and then regenerate TestRecordWithLogicalTypes and check both in. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (AVRO-2091) Eliminate org.apache.avro.specific.use_custom_coder feature flag
[ https://issues.apache.org/jira/browse/AVRO-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymie Stata updated AVRO-2091: --- Description: After the implementation of "custom coders" (AVRO-2090) is complete and seen more production usage, this feature flag should be eliminated. (More specifically, the initial release of AVRO-2090 should set USE_CUSTOM_CODERS to false by default, to get some initial production testing. The release after that should set this flag to true by default, but allow folks to fall back on the old way in case there are corner cases that aren't working. The release after that should remove this feature flag altogether, under the assumption that it works just fine and there's no need to maintain two ways of doing things.) (was: After the implementation of "custom coders" (AVRO-2090) is complete and seen more production usage, this feature flag should be eliminated.) > Eliminate org.apache.avro.specific.use_custom_coder feature flag > > > Key: AVRO-2091 > URL: https://issues.apache.org/jira/browse/AVRO-2091 > Project: Avro > Issue Type: Improvement > Components: java >Reporter: Raymie Stata >Priority: Minor > > After the implementation of "custom coders" (AVRO-2090) is complete and seen > more production usage, this feature flag should be eliminated. (More > specifically, the initial release of AVRO-2090 should set USE_CUSTOM_CODERS > to false by default, to get some initial production testing. The release > after that should set this flag to true by default, but allow folks to fall > back on the old way in case there are corner cases that aren't working. The > release after that should remove this feature flag altogether, under the > assumption that it works just fine and there's no need to maintain two ways > of doing things.) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (AVRO-2090) Improve encode/decode time for SpecificRecord using code generation
[ https://issues.apache.org/jira/browse/AVRO-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymie Stata updated AVRO-2090: --- Attachment: customcoders.md Attaching a design document for (forthcoming) patch. > Improve encode/decode time for SpecificRecord using code generation > --- > > Key: AVRO-2090 > URL: https://issues.apache.org/jira/browse/AVRO-2090 > Project: Avro > Issue Type: Improvement > Components: java >Reporter: Raymie Stata > Attachments: customcoders.md > > > Compared to GenericRecords, SpecificRecords offer type-safety plus the > performance of traditional getters/setters/instance variables. But these are > only beneficial to Java code accessing those records. SpecificRecords > inherit serialization and deserialization code from GenericRecords, which is > dynamic and thus slow (in fact, benchmarks show that serialization and > deserialization is _slower_ for SpecificRecord than for GenericRecord). > This patch extends record.vm to generate custom, higher-performance encoder > and decoder functions for SpecificRecords. We've run a public benchmark > showing that the new code reduces serialization time by 2/3 and > deserialization time by close to 50%. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (AVRO-2094) Extend "custom coders" to support logical types
Raymie Stata created AVRO-2094: -- Summary: Extend "custom coders" to support logical types Key: AVRO-2094 URL: https://issues.apache.org/jira/browse/AVRO-2094 Project: Avro Issue Type: Improvement Reporter: Raymie Stata The initial implementation of "custom coders" (AVRO-2090) does not support Avro's logical types. This JIRA extends that implementation to remove this limitation. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (AVRO-2093) Extend "custom coders" to fully support union types
Raymie Stata created AVRO-2093: -- Summary: Extend "custom coders" to fully support union types Key: AVRO-2093 URL: https://issues.apache.org/jira/browse/AVRO-2093 Project: Avro Issue Type: Improvement Reporter: Raymie Stata The initial implementation of "custom coders" for SpecificRecord (AVRO-2090) only supports "nullable unions" (two-branch unions where one branch is the null type). This JIRA extends that implementation to support all forms of unions. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (AVRO-2092) Flip default value of org.apache.avro.specific.use_custom_coder to true
Raymie Stata created AVRO-2092: -- Summary: Flip default value of org.apache.avro.specific.use_custom_coder to true Key: AVRO-2092 URL: https://issues.apache.org/jira/browse/AVRO-2092 Project: Avro Issue Type: Improvement Components: java Reporter: Raymie Stata Priority: Minor The initial implementation of "custom coders" for SpecificRecord is incomplete (it didn't initially handle logical types) and hasn't been battle-tested. Thus, it includes a feature flag (org.apache.avro.specific.use_custom_coder) to toggle between the new code and the old code. The initial default for this feature flag is false -- defaulting to the old code -- but when the implementation of SpecificRecord is completed and it's seen more production use, we should switch the default to false, on the way to eliminating the flag altogether (AVRO-2091). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (AVRO-2091) Eliminate org.apache.avro.specific.use_custom_coder feature flag
Raymie Stata created AVRO-2091: -- Summary: Eliminate org.apache.avro.specific.use_custom_coder feature flag Key: AVRO-2091 URL: https://issues.apache.org/jira/browse/AVRO-2091 Project: Avro Issue Type: Improvement Components: java Reporter: Raymie Stata Priority: Minor After the implementation of "custom coders" (AVRO-2090) is complete and seen more production usage, this feature flag should be eliminated. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (AVRO-2090) Improve encode/decode time for SpecificRecord using code generation
Raymie Stata created AVRO-2090: -- Summary: Improve encode/decode time for SpecificRecord using code generation Key: AVRO-2090 URL: https://issues.apache.org/jira/browse/AVRO-2090 Project: Avro Issue Type: Improvement Components: java Reporter: Raymie Stata Compared to GenericRecords, SpecificRecords offer type-safety plus the performance of traditional getters/setters/instance variables. But these are only beneficial to Java code accessing those records. SpecificRecords inherit serialization and deserialization code from GenericRecords, which is dynamic and thus slow (in fact, benchmarks show that serialization and deserialization is _slower_ for SpecificRecord than for GenericRecord). This patch extends record.vm to generate custom, higher-performance encoder and decoder functions for SpecificRecords. We've run a public benchmark showing that the new code reduces serialization time by 2/3 and deserialization time by close to 50%. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-806) add a column-major codec for data files
[ https://issues.apache.org/jira/browse/AVRO-806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymie Stata updated AVRO-806: -- In about a month we will have some Hive benchmarks, but the data won't be very wide, so they won't be good for testing column-major formats. However, maybe we should walk before we run: If someone puts Avro SerDe's in place against the regular Avro format, we could benchmark and maybe even help tune that configuration, which would provide a baseline for testing a column-major configuration. (Unfortunately, we can't do the SerDe work itself.) add a column-major codec for data files --- Key: AVRO-806 URL: https://issues.apache.org/jira/browse/AVRO-806 Project: Avro Issue Type: New Feature Components: java, spec Reporter: Doug Cutting Assignee: Doug Cutting Fix For: 1.7.0 Attachments: AVRO-806-v2.patch, AVRO-806.patch, avro-file-columnar.pdf Define a codec that, when a data file's schema is a record schema, writes blocks within the file in column-major order. This would permit better compression and also permit efficient skipping of fields that are not of interest. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (AVRO-806) add a column-major codec for data files
[ https://issues.apache.org/jira/browse/AVRO-806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13260732#comment-13260732 ] Raymie Stata commented on AVRO-806: --- This is the second attempt at a column-major codec. The whole goal of col-major formats is to optimize performance. Thus, to drive this exercise forward it seems necessary to have some kind of benchmark to do some testing. (I don't think a micro-benchmark is sufficient -- rather the right benchmark is with a query planner (Hive?) that can take advantage of these formats.) With such a benchmark in place, we'd compare the performance of the existing row-major (as a baseline) Avro formats with the various, proposed col-major formats to make sure that we're getting the kind of performance improvements (2x, 4x or more) to justify the complexity of a col-major format. Some comments more specific to this proposal: First, I'd like to see the Type Mapping section for Avro filled in; this would give us a much better idea of what you're trying. Second, at first glance, it seems like your design replicates some of the features of RCFiles that the CIF paper claims cause performance problems (but, again, maybe this issue is better addressed via some benchmarking). Regarding your implementation of this proposal, it re-implements all the lower-levels of Avro. It seems like this double-implementation will be a maintenance problem. add a column-major codec for data files --- Key: AVRO-806 URL: https://issues.apache.org/jira/browse/AVRO-806 Project: Avro Issue Type: New Feature Components: java, spec Reporter: Doug Cutting Assignee: Doug Cutting Fix For: 1.7.0 Attachments: AVRO-806-v2.patch, AVRO-806.patch, avro-file-columnar.pdf Define a codec that, when a data file's schema is a record schema, writes blocks within the file in column-major order. This would permit better compression and also permit efficient skipping of fields that are not of interest. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira