Hi John.
Sorry for the late reply I was off work for a few days ill. The idea of using reading the schema from the file and then processing it without knowing the structure beforehand is the main use case for the GenericDatum. The inefficiencies relate to the way that GenericDatum handles arrays. In our use case, most of the data consists of large floating point arrays, or multi-dimensional (nested) arrays. GenericDatum stores these by using a STL vector. So if you were using C++ structures, you may have a float[1000] - but if you use GenericDatum you get a vector<GenericDatum> where each GenericDatum contains a single float. Of course this is the most flexible way of implementing it, and it works - but it uses considerably more memory. Profiling a read of one of our data structures showed that more than 50% of the time was spent in malloc()/ free() ! If your data doesn't have lots of large numeric arrays then by all means, GenericDatum should work reasonably well for you. As for the compression - as Doug has already answered, C++ only supports null (no) codec and deflate. You can always use the Avro java tools 'recodec' command to convert from an unsupported codec to deflate if you need to. The groundwork for codec support in C++ is already there - it should be quite easy to add additional codecs now. Most of the work would be in getting the makefile/library stuff right. Regards, Steve Roehrs Senior Software Engineer | Lockheed Martin | p: +61 8 7389 4525 | m: +61 4 3891 5622 | f: +61 8 7389 4551 | w: www.rlmgroup.com.au | e: steve.roe...@rlmgroup.com.au | Company address: 82-86 Woomera Ave, Edinburgh, SA 5111 This email and any attachment to it remains the property of Lockheed Martin and is intended only to be read or used by the named addressee. It may contain information that is confidential, commercially valuable or subject to legal privilege. If you receive this email in error, please immediately delete it and notify the sender. Opinions, conclusions and other information in this message that do not relate to the official business of Lockheed Martin or any companies within Lockheed Martin shall be understood as neither given nor endorsed by them. ________________________________ From: John Lilley [mailto:john.lil...@redpoint.net] Sent: Friday, August 15, 2014 1:16 AM To: user@avro.apache.org; Steve Roehrs Subject: RE: State of the C++ vs Java implementations Steve, Thanks so much for the reply. I hope that I can inconvenience you for a little more guidance. We want to read and write Avro data files whose schema is not known until run-time, when we read the file metadata and transform that into our own internal record structure. So we are not mapping to a C++ struct/class with defined compile-time members. We just want to loop over the records and columns in the data file, transforming them serially. Can this be done without incurring the performance penalty of GenericDatum that you speak of? Different question: do you know if the full complement of compression codecs is available in C++? We don't need "everything possible", but we want to be able to read 99.9% of files that we are likely to encounter in practice. Thanks John From: Steve Roehrs [mailto:steve.roe...@rlmgroup.com.au] Sent: Sunday, August 10, 2014 11:25 PM To: user@avro.apache.org Subject: RE: State of the C++ vs Java implementations Hi John You can definitely read and write Avro data files using C++. The DataFileWriter and DataFileReader classes are what you need. The README is severely out of date. I can't comment on the relative performance of the Java/C++ API's - we used the C++ API for our application, but for performance reasons we don't use the GenericDatum class, as it does have poor performance for our particular mix of data. I don't know if the Java API fares any better in this regard. Regards, Steve Roehrs Senior Software Engineer | Lockheed Martin | p: +61 8 7389 4525 | m: +61 4 3891 5622 | f: +61 8 7389 4551 | w: www.rlmgroup.com.au | e: steve.roe...@rlmgroup.com.au | Company address: 82-86 Woomera Ave, Edinburgh, SA 5111 This email and any attachment to it remains the property of Lockheed Martin and is intended only to be read or used by the named addressee. It may contain information that is confidential, commercially valuable or subject to legal privilege. If you receive this email in error, please immediately delete it and notify the sender. Opinions, conclusions and other information in this message that do not relate to the official business of Lockheed Martin or any companies within Lockheed Martin shall be understood as neither given nor endorsed by them. ________________________________ From: John Lilley [mailto:john.lil...@redpoint.net] Sent: Wednesday, August 06, 2014 6:28 AM To: user@avro.apache.org Subject: State of the C++ vs Java implementations Greetings, I am desiring to read and write Avro files (such as those manipulated by MapReduce applications) from a C++ program. While there are higher-level wrappers (such as Hive), I am interested in reading/writing the files directly. There are both C++ and Java library implementations; however, in the C++ API README I see "And the file and rpc containers are not yet implemented." Does this mean that I can't read and write Avro files using the C++ library? We have very good C++/JNI wrapper-generator, so using the Java is not terribly difficult. Given that, which interface would you recommend? Does the C++ interface (assuming it works) have significant performance advantages? Thanks john