RE: c++ DataFileWriter not doing validation
From: SCHENK, Jarrad Sent: Tuesday, 21 May 2013 12:52 PM To: user@avro.apache.org Subject: c++ DataFileWriter not doing validation Hi List, I'm working with the c++ bindings to try to write data to avro files. Much of the documentation assumes that the types to be written (and the code to write the data) are generated using avrogencpp. In my case I have an existing set of type/struct hierarchies that I'm trying to write so I don't want to use the output of avrogencpp directly. Instead I am producing code that is very similar to what avrogencpp outputs but is adapted to suit my types. What I'm finding is that the c++ DataFileWriter does no validation between the schema that I provide and the datums that get written. As such any discrepancy between the schema and the datums that are written causes the file to be corrupted and essentially unreadable. I see that there is a ValidatingEncoder class that can be used when serialising to a memorystream (as per the Getting Started docs) but there doesn't appear to be any method for using this encoder with the DataFileWriter. Am I missing something? Is there a preferred way to make the writer do validation? Thanks Jarrad I never got any responses to this so I assume either no-one knows, or no-one actively uses the C++ bindings anymore? My basic question boils down to can the DataFileWriter be made to do validation against the schema? Any help much appreciated. Jarrad Warning: The information contained in this email and any attached files is confidential to BAE Systems Australia. If you are not the intended recipient, any use, disclosure or copying of this email or any attachments is expressly prohibited. If you have received this email in error, please notify us immediately. VIRUS: Every care has been taken to ensure this email and its attachments are virus free, however, any loss or damage incurred in using this email is not the sender's responsibility. It is your responsibility to ensure virus checks are completed before installing any data sent in this email to your computer.
Re: Is Avro right for me?
Whats more, there are examples and support for Kafka, but not so much for Flume. On Mon, May 27, 2013 at 6:25 AM, Martin Kleppmann mar...@rapportive.comwrote: I don't have experience with Flume, so I can't comment on that. At LinkedIn we ship logs around by sending Avro-encoded messages to Kafka ( http://kafka.apache.org/). Kafka is nice, it scales very well and gives a great deal of flexibility — logs can be consumed by any number of independent consumers, consumers can catch up on a backlog if they're disconnected for a while, and it comes with Hadoop import out of the box. (RabbitMQ is more designed or use cases where each message corresponds to a task that needs to be performed by a worker. IMHO Kafka is a better fit for logs, which are more stream-like.) With any message broker, you'll need to somehow tag each message with the schema that was used to encode it. You could include the full schema with every message, but unless you have very large messages, that would be a huge overhead. Better to give each version of your schema a sequential version number, or hash the schema, and include the version number/hash in each message. You can then keep a repository of schemas for resolving those version numbers or hashes – simply in files that you distribute to all producers/consumers, or in a simple REST service like https://issues.apache.org/jira/browse/AVRO-1124 Hope that helps, Martin On 26 May 2013 17:39, Mark static.void@gmail.com wrote: Yes our central server would be Hadoop. Exactly how would this work with flume? Would I write Avro to a file source which flume would then ship over to one of our collectors or is there a better/native way? Would I have to include the schema in each event? FYI we would be doing this primarily from a rails application. Does anyone ever use Avro with a message bus like RabbitMQ? On May 23, 2013, at 9:16 PM, Sean Busbey bus...@cloudera.com wrote: Yep. Avro would be great at that (provided your central consumer is Avro friendly, like a Hadoop system). Make sure that all of your schemas have default values defined for fields so that schema evolution will be easier in the future. On Thu, May 23, 2013 at 4:29 PM, Mark static.void@gmail.com wrote: We're thinking about generating logs and events with Avro and shipping them to a central collector service via Flume. Is this a valid use case? -- Sean -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com
Re: Is Avro right for me?
Mark: FWIW: I would go with Kafka if you can, it's far more flexible; we aren't using it until it authenticates producers and consumers and provides a way to encrypt transport - we run in the cloud... Anyway, so we're using Flume. For Flume, with the current out of the box implementation, they encapsulate data in an Avro event themselves. So it's up to you what you stick into the body of that Avro event. It could just be json, or it could be your own serialized Avro event - and as far as I understand serialized Avro always has the schema with it (right?). Be aware that Flume doesn't have great support for languages outside of the JVM. Flume's Avro source that you can communicate with via Avro RPC uses NettyServer/NettyTransceiver underneath, and as far as I know, there's been no updates to other Avro RPC libraries e.g. Python, Ruby that enable talking to such an Avro RPC endpoint. So you either have to build a client that speaks that, or create your own source. Cheers, Stefan On Mon, May 27, 2013 at 11:08 AM, Russell Jurney russell.jur...@gmail.comwrote: Whats more, there are examples and support for Kafka, but not so much for Flume. On Mon, May 27, 2013 at 6:25 AM, Martin Kleppmann mar...@rapportive.comwrote: I don't have experience with Flume, so I can't comment on that. At LinkedIn we ship logs around by sending Avro-encoded messages to Kafka ( http://kafka.apache.org/). Kafka is nice, it scales very well and gives a great deal of flexibility — logs can be consumed by any number of independent consumers, consumers can catch up on a backlog if they're disconnected for a while, and it comes with Hadoop import out of the box. (RabbitMQ is more designed or use cases where each message corresponds to a task that needs to be performed by a worker. IMHO Kafka is a better fit for logs, which are more stream-like.) With any message broker, you'll need to somehow tag each message with the schema that was used to encode it. You could include the full schema with every message, but unless you have very large messages, that would be a huge overhead. Better to give each version of your schema a sequential version number, or hash the schema, and include the version number/hash in each message. You can then keep a repository of schemas for resolving those version numbers or hashes – simply in files that you distribute to all producers/consumers, or in a simple REST service like https://issues.apache.org/jira/browse/AVRO-1124 Hope that helps, Martin On 26 May 2013 17:39, Mark static.void@gmail.com wrote: Yes our central server would be Hadoop. Exactly how would this work with flume? Would I write Avro to a file source which flume would then ship over to one of our collectors or is there a better/native way? Would I have to include the schema in each event? FYI we would be doing this primarily from a rails application. Does anyone ever use Avro with a message bus like RabbitMQ? On May 23, 2013, at 9:16 PM, Sean Busbey bus...@cloudera.com wrote: Yep. Avro would be great at that (provided your central consumer is Avro friendly, like a Hadoop system). Make sure that all of your schemas have default values defined for fields so that schema evolution will be easier in the future. On Thu, May 23, 2013 at 4:29 PM, Mark static.void@gmail.com wrote: We're thinking about generating logs and events with Avro and shipping them to a central collector service via Flume. Is this a valid use case? -- Sean -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome. com