Re: Is Avro right for me?
Also, if you end up choosing to use Kafka and persisting your messages into Hadoop, then you should take a look at Camushttps://github.com/linkedin/camus (which is also from LinkedIn). If you do things the LinkedIn way right from the start (i.e.: using the AVRO-1124 schema repo and encoding time in a standard way in a header contained in all your schemas), then you can use Camus pretty much out of the box without any tweaking, and the solution you'll get is very flexible / extensible (regarding the ability to evolve your schemas gracefully, letting Camus discover new topics to persist automatically, etc). For us it was a little more complicated since we had some legacy stuff that was not exactly how Camus expected it, but it wasn't that complicated to integrate either... -- Felix On Thu, Jun 6, 2013 at 2:51 PM, Felix GV fe...@mate1inc.com wrote: You can serialize avro messages into json or binary format, and then pass them around to anything else (send them over HTTP, publish them to a message broker system like Kafka or Flume, write them directly into a data store, etc.). You can forget about the avro RPC, as it's just one way amongst many of doing this. You do need to manage schemas properly though. The easy way is to hardcode your schema on both ends, but then that makes it harder to evolve schemas (which avro can do very well otherwise). If you send single serialized avro messages around through a message broker system, then you should definitely consider using a version number for your schema at the beginning of the message, as Martin suggested. Then you can look up what schema each version number represents with something like the versioned schema repo in AVRO-1124 https://issues.apache.org/jira/browse/AVRO-1124. -- Felix On Tue, Jun 4, 2013 at 11:10 PM, Mark static.void@gmail.com wrote: I have a question. Say I want to use AVRO as my serialization format to speak between service applications. Do I need to use AVRO RPC for this or can I just exchange AVRO messages over HTTP? Also whats the difference between an IPC client and an HTTP IPC client? https://github.com/apache/avro/tree/trunk/lang/ruby/test Thanks On May 29, 2013, at 8:02 PM, Mike Percy mpe...@apache.org wrote: There is no Ruby support for the Netty Avro RPC protocol that I know of. But I'm not sure why that matters, other than the fact that the Flume Thrift support it's not in an official release yet. You could also take a look at the Flume HTTP source for a REST-based interface, but to accept binary data instead of JSON (the default) you would need to write a small bit of Java code and plug that in. Make sure you differentiate between using Avro as a data storage format and as an RPC mechanism. They are two very different things and don't need to be tied together. Today, the data storage aspect is more mature and has much wider language support. Mike On Wed, May 29, 2013 at 9:30 AM, Mark static.void@gmail.com wrote: So basically Avro RPC is out of the question? Instead I would need to Avro Message - Thrift - Flume? Is that along the right lines or am I missing something? On May 28, 2013, at 5:02 PM, Mike Percy mpe...@apache.org wrote: Regarding Ruby support, we recently added support for Thrift RPC, so you can now send messages to Flume via Ruby and other non-JVM languages. We don't have out-of-the-box client APIs for those yet but would be happy to accept patches for it :)
Re: Is Avro right for me?
Flume is actually working on what you might call first class Avro support right now, but today you can use it and there are people doing so in production with success. First of all, I assume that you want to store binary-encoded avro in each event. As mentioned previously in this thread, this implies that the schema needs to come from somewhere. Right now, with the released version of Flume (1.3.1) you would want to write your own EventSerializer http://flume.apache.org/FlumeUserGuide.html#event-serializers for each schema you need to write to HDFS. There is a base class http://flume.apache.org/releases/content/1.3.1/apidocs/org/apache/flume/serialization/AbstractAvroEventSerializer.html you can subclass that makes it easier to serialize Avro at that level. There is a bunch of new development underway to make this a lot easier to deal with. 1. Something to parse Avro container files and send them to Flume: https://issues.apache.org/jira/browse/FLUME-2048 2. A generic event serializer that keys off a hash in the event header to determine the schema: https://issues.apache.org/jira/browse/FLUME-2010 Regarding Ruby support, we recently added support for Thrift RPC, so you can now send messages to Flume via Ruby and other non-JVM languages. We don't have out-of-the-box client APIs for those yet but would be happy to accept patches for it :) Feel free to reach out to d...@flume.apache.org or u...@flume.apache.org if you'd like more information or want to help get these features finalized sooner! Mike On Tue, May 28, 2013 at 3:38 PM, Mark static.void@gmail.com wrote: Thanks for all of the information. I actually looked into Kafka quite some time ago and I think we passed on it because it didn't have much ruby support (That may have changed by now). On May 27, 2013, at 12:34 PM, Martin Kleppmann mar...@rapportive.com wrote: On 27 May 2013 20:00, Stefan Krawczyk ste...@nextdoor.com wrote: So it's up to you what you stick into the body of that Avro event. It could just be json, or it could be your own serialized Avro event - and as far as I understand serialized Avro always has the schema with it (right?). In an Avro data file, yes, because you just need to specify the schema once, followed by (say) a million records that all use the same schema. And in an RPC context, you can negotiate the schema once per connection. But when using a message broker, you're serializing individual records and don't have an end-to-end connection with the consumer, so you'd need to include the schema with every single message. It probably doesn't make sense to include the full schema with every one, as a typical schema might be 2 kB whereas a serialized record might be less than 100 bytes (numbers obviously vary wildly by application), so the schema size would dominate. Hence my suggestion of including a schema version number or hash with every message. Be aware that Flume doesn't have great support for languages outside of the JVM. The same caveat unfortunately applies with Kafka too. There are clients for non-JVM languages, but they lack important features, so I would recommend using the official JVM client (if your application is non-JVM you could simply pipe your application's stdout into the Kafka producer, or vice versa on the consumer side). Martin
Re: Is Avro right for me?
Whats more, there are examples and support for Kafka, but not so much for Flume. On Mon, May 27, 2013 at 6:25 AM, Martin Kleppmann mar...@rapportive.comwrote: I don't have experience with Flume, so I can't comment on that. At LinkedIn we ship logs around by sending Avro-encoded messages to Kafka ( http://kafka.apache.org/). Kafka is nice, it scales very well and gives a great deal of flexibility — logs can be consumed by any number of independent consumers, consumers can catch up on a backlog if they're disconnected for a while, and it comes with Hadoop import out of the box. (RabbitMQ is more designed or use cases where each message corresponds to a task that needs to be performed by a worker. IMHO Kafka is a better fit for logs, which are more stream-like.) With any message broker, you'll need to somehow tag each message with the schema that was used to encode it. You could include the full schema with every message, but unless you have very large messages, that would be a huge overhead. Better to give each version of your schema a sequential version number, or hash the schema, and include the version number/hash in each message. You can then keep a repository of schemas for resolving those version numbers or hashes – simply in files that you distribute to all producers/consumers, or in a simple REST service like https://issues.apache.org/jira/browse/AVRO-1124 Hope that helps, Martin On 26 May 2013 17:39, Mark static.void@gmail.com wrote: Yes our central server would be Hadoop. Exactly how would this work with flume? Would I write Avro to a file source which flume would then ship over to one of our collectors or is there a better/native way? Would I have to include the schema in each event? FYI we would be doing this primarily from a rails application. Does anyone ever use Avro with a message bus like RabbitMQ? On May 23, 2013, at 9:16 PM, Sean Busbey bus...@cloudera.com wrote: Yep. Avro would be great at that (provided your central consumer is Avro friendly, like a Hadoop system). Make sure that all of your schemas have default values defined for fields so that schema evolution will be easier in the future. On Thu, May 23, 2013 at 4:29 PM, Mark static.void@gmail.com wrote: We're thinking about generating logs and events with Avro and shipping them to a central collector service via Flume. Is this a valid use case? -- Sean -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com
Re: Is Avro right for me?
Mark: FWIW: I would go with Kafka if you can, it's far more flexible; we aren't using it until it authenticates producers and consumers and provides a way to encrypt transport - we run in the cloud... Anyway, so we're using Flume. For Flume, with the current out of the box implementation, they encapsulate data in an Avro event themselves. So it's up to you what you stick into the body of that Avro event. It could just be json, or it could be your own serialized Avro event - and as far as I understand serialized Avro always has the schema with it (right?). Be aware that Flume doesn't have great support for languages outside of the JVM. Flume's Avro source that you can communicate with via Avro RPC uses NettyServer/NettyTransceiver underneath, and as far as I know, there's been no updates to other Avro RPC libraries e.g. Python, Ruby that enable talking to such an Avro RPC endpoint. So you either have to build a client that speaks that, or create your own source. Cheers, Stefan On Mon, May 27, 2013 at 11:08 AM, Russell Jurney russell.jur...@gmail.comwrote: Whats more, there are examples and support for Kafka, but not so much for Flume. On Mon, May 27, 2013 at 6:25 AM, Martin Kleppmann mar...@rapportive.comwrote: I don't have experience with Flume, so I can't comment on that. At LinkedIn we ship logs around by sending Avro-encoded messages to Kafka ( http://kafka.apache.org/). Kafka is nice, it scales very well and gives a great deal of flexibility — logs can be consumed by any number of independent consumers, consumers can catch up on a backlog if they're disconnected for a while, and it comes with Hadoop import out of the box. (RabbitMQ is more designed or use cases where each message corresponds to a task that needs to be performed by a worker. IMHO Kafka is a better fit for logs, which are more stream-like.) With any message broker, you'll need to somehow tag each message with the schema that was used to encode it. You could include the full schema with every message, but unless you have very large messages, that would be a huge overhead. Better to give each version of your schema a sequential version number, or hash the schema, and include the version number/hash in each message. You can then keep a repository of schemas for resolving those version numbers or hashes – simply in files that you distribute to all producers/consumers, or in a simple REST service like https://issues.apache.org/jira/browse/AVRO-1124 Hope that helps, Martin On 26 May 2013 17:39, Mark static.void@gmail.com wrote: Yes our central server would be Hadoop. Exactly how would this work with flume? Would I write Avro to a file source which flume would then ship over to one of our collectors or is there a better/native way? Would I have to include the schema in each event? FYI we would be doing this primarily from a rails application. Does anyone ever use Avro with a message bus like RabbitMQ? On May 23, 2013, at 9:16 PM, Sean Busbey bus...@cloudera.com wrote: Yep. Avro would be great at that (provided your central consumer is Avro friendly, like a Hadoop system). Make sure that all of your schemas have default values defined for fields so that schema evolution will be easier in the future. On Thu, May 23, 2013 at 4:29 PM, Mark static.void@gmail.com wrote: We're thinking about generating logs and events with Avro and shipping them to a central collector service via Flume. Is this a valid use case? -- Sean -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome. com
Re: Is Avro right for me?
Yes our central server would be Hadoop. Exactly how would this work with flume? Would I write Avro to a file source which flume would then ship over to one of our collectors or is there a better/native way? Would I have to include the schema in each event? FYI we would be doing this primarily from a rails application. Does anyone ever use Avro with a message bus like RabbitMQ? On May 23, 2013, at 9:16 PM, Sean Busbey bus...@cloudera.com wrote: Yep. Avro would be great at that (provided your central consumer is Avro friendly, like a Hadoop system). Make sure that all of your schemas have default values defined for fields so that schema evolution will be easier in the future. On Thu, May 23, 2013 at 4:29 PM, Mark static.void@gmail.com wrote: We're thinking about generating logs and events with Avro and shipping them to a central collector service via Flume. Is this a valid use case? -- Sean