Re: Is Avro right for me?

2013-06-06 Thread Felix GV
Also, if you end up choosing to use Kafka and persisting your messages into
Hadoop, then you should take a look at
Camushttps://github.com/linkedin/camus (which
is also from LinkedIn).

If you do things the LinkedIn way right from the start (i.e.: using the
AVRO-1124 schema repo and encoding time in a standard way in a header
contained in all your schemas), then you can use Camus pretty much out of
the box without any tweaking, and the solution you'll get is very flexible
/ extensible (regarding the ability to evolve your schemas gracefully,
letting Camus discover new topics to persist automatically, etc).

For us it was a little more complicated since we had some legacy stuff that
was not exactly how Camus expected it, but it wasn't that complicated to
integrate either...

--
Felix


On Thu, Jun 6, 2013 at 2:51 PM, Felix GV fe...@mate1inc.com wrote:

 You can serialize avro messages into json or binary format, and then pass
 them around to anything else (send them over HTTP, publish them to a
 message broker system like Kafka or Flume, write them directly into a data
 store, etc.). You can forget about the avro RPC, as it's just one way
 amongst many of doing this.

 You do need to manage schemas properly though. The easy way is to hardcode
 your schema on both ends, but then that makes it harder to evolve schemas
 (which avro can do very well otherwise). If you send single serialized avro
 messages around through a message broker system, then you should definitely
 consider using a version number for your schema at the beginning of the
 message, as Martin suggested. Then you can look up what schema each version
 number represents with something like the versioned schema repo in
 AVRO-1124 https://issues.apache.org/jira/browse/AVRO-1124.

 --
 Felix


 On Tue, Jun 4, 2013 at 11:10 PM, Mark static.void@gmail.com wrote:

 I have a question.  Say I want to use AVRO as my serialization format to
 speak between service applications. Do I need to use AVRO RPC for this or
 can I just exchange AVRO messages over HTTP?

 Also whats the difference between an IPC client and an HTTP IPC client?
 https://github.com/apache/avro/tree/trunk/lang/ruby/test

 Thanks


 On May 29, 2013, at 8:02 PM, Mike Percy mpe...@apache.org wrote:

 There is no Ruby support for the Netty Avro RPC protocol that I know of.
 But I'm not sure why that matters, other than the fact that the Flume
 Thrift support it's not in an official release yet.

 You could also take a look at the Flume HTTP source for a REST-based
 interface, but to accept binary data instead of JSON (the default) you
 would need to write a small bit of Java code and plug that in.

 Make sure you differentiate between using Avro as a data storage format
 and as an RPC mechanism. They are two very different things and don't need
 to be tied together. Today, the data storage aspect is more mature and has
 much wider language support.

 Mike


 On Wed, May 29, 2013 at 9:30 AM, Mark static.void@gmail.com wrote:

 So basically Avro RPC is out of the question? Instead I would need to
 Avro Message - Thrift - Flume? Is that along the right lines or am I
 missing something?


 On May 28, 2013, at 5:02 PM, Mike Percy mpe...@apache.org wrote:

 Regarding Ruby support, we recently added support for Thrift RPC, so you
 can now send messages to Flume via Ruby and other non-JVM languages. We
 don't have out-of-the-box client APIs for those yet but would be happy to
 accept patches for it :)








Re: Is Avro right for me?

2013-05-28 Thread Mike Percy
Flume is actually working on what you might call first class Avro support
right now, but today you can use it and there are people doing so in
production with success.

First of all, I assume that you want to store binary-encoded avro in each
event. As mentioned previously in this thread, this implies that the schema
needs to come from somewhere. Right now, with the released version of Flume
(1.3.1) you would want to write your own EventSerializer 
http://flume.apache.org/FlumeUserGuide.html#event-serializers for each
schema you need to write to HDFS. There is a base class 
http://flume.apache.org/releases/content/1.3.1/apidocs/org/apache/flume/serialization/AbstractAvroEventSerializer.html
you
can subclass that makes it easier to serialize Avro at that level.

There is a bunch of new development underway to make this a lot easier to
deal with.

1. Something to parse Avro container files and send them to Flume:
https://issues.apache.org/jira/browse/FLUME-2048
2. A generic event serializer that keys off a hash in the event header to
determine the schema: https://issues.apache.org/jira/browse/FLUME-2010

Regarding Ruby support, we recently added support for Thrift RPC, so you
can now send messages to Flume via Ruby and other non-JVM languages. We
don't have out-of-the-box client APIs for those yet but would be happy to
accept patches for it :)

Feel free to reach out to d...@flume.apache.org or u...@flume.apache.org if
you'd like more information or want to help get these features finalized
sooner!

Mike



On Tue, May 28, 2013 at 3:38 PM, Mark static.void@gmail.com wrote:

 Thanks for all of the information.

 I actually looked into Kafka quite some time ago and I think we passed on
 it because it didn't have much ruby support (That may have changed by now).


 On May 27, 2013, at 12:34 PM, Martin Kleppmann mar...@rapportive.com
 wrote:

 On 27 May 2013 20:00, Stefan Krawczyk ste...@nextdoor.com wrote:

 So it's up to you what you stick into the body of that Avro event. It
 could just be json, or it could be your own serialized Avro event - and as
 far as I understand serialized Avro always has the schema with it (right?).


 In an Avro data file, yes, because you just need to specify the schema
 once, followed by (say) a million records that all use the same schema. And
 in an RPC context, you can negotiate the schema once per connection. But
 when using a message broker, you're serializing individual records and
 don't have an end-to-end connection with the consumer, so you'd need to
 include the schema with every single message.

 It probably doesn't make sense to include the full schema with every one,
 as a typical schema might be 2 kB whereas a serialized record might be less
 than 100 bytes (numbers obviously vary wildly by application), so the
 schema size would dominate. Hence my suggestion of including a schema
 version number or hash with every message.

 Be aware that Flume doesn't have great support for languages outside of
 the JVM.


 The same caveat unfortunately applies with Kafka too. There are clients
 for non-JVM languages, but they lack important features, so I would
 recommend using the official JVM client (if your application is non-JVM you
 could simply pipe your application's stdout into the Kafka producer, or
 vice versa on the consumer side).

 Martin





Re: Is Avro right for me?

2013-05-27 Thread Russell Jurney
Whats more, there are examples and support for Kafka, but not so much for
Flume.


On Mon, May 27, 2013 at 6:25 AM, Martin Kleppmann mar...@rapportive.comwrote:

 I don't have experience with Flume, so I can't comment on that. At
 LinkedIn we ship logs around by sending Avro-encoded messages to Kafka (
 http://kafka.apache.org/). Kafka is nice, it scales very well and gives a
 great deal of flexibility — logs can be consumed by any number of
 independent consumers, consumers can catch up on a backlog if they're
 disconnected for a while, and it comes with Hadoop import out of the box.

 (RabbitMQ is more designed or use cases where each message corresponds to
 a task that needs to be performed by a worker. IMHO Kafka is a better fit
 for logs, which are more stream-like.)

 With any message broker, you'll need to somehow tag each message with the
 schema that was used to encode it. You could include the full schema with
 every message, but unless you have very large messages, that would be a
 huge overhead. Better to give each version of your schema a sequential
 version number, or hash the schema, and include the version number/hash in
 each message. You can then keep a repository of schemas for resolving those
 version numbers or hashes – simply in files that you distribute to all
 producers/consumers, or in a simple REST service like
 https://issues.apache.org/jira/browse/AVRO-1124

 Hope that helps,
 Martin


 On 26 May 2013 17:39, Mark static.void@gmail.com wrote:

 Yes our central server would be Hadoop.

 Exactly how would this work with flume? Would I write Avro to a file
 source which flume would then ship over to one of our collectors  or is
 there a better/native way? Would I have to include the schema in each
 event? FYI we would be doing this primarily from a rails application.

 Does anyone ever use Avro with a message bus like RabbitMQ?

 On May 23, 2013, at 9:16 PM, Sean Busbey bus...@cloudera.com wrote:

 Yep. Avro would be great at that (provided your central consumer is Avro
 friendly, like a Hadoop system).  Make sure that all of your schemas have
 default values defined for fields so that schema evolution will be easier
 in the future.


 On Thu, May 23, 2013 at 4:29 PM, Mark static.void@gmail.com wrote:

 We're thinking about generating logs and events with Avro and shipping
 them to a central collector service via Flume. Is this a valid use case?




 --
 Sean






-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com


Re: Is Avro right for me?

2013-05-27 Thread Stefan Krawczyk
Mark:
FWIW: I would go with Kafka if you can, it's far more flexible; we aren't
using it until it authenticates producers and consumers and provides a way
to encrypt transport  - we run in the cloud...

Anyway, so we're using Flume. For Flume, with the current out of the box
implementation, they encapsulate data in an Avro event themselves.

So it's up to you what you stick into the body of that Avro event. It could
just be json, or it could be your own serialized Avro event - and as far as
I understand serialized Avro always has the schema with it (right?).

Be aware that Flume doesn't have great support for languages outside of the
JVM. Flume's Avro source that you can communicate with via Avro RPC uses
NettyServer/NettyTransceiver underneath, and as far as I know, there's been
no updates to other Avro RPC libraries e.g. Python, Ruby that enable
talking to such an Avro RPC endpoint. So you either have to build a client
that speaks that, or create your own source.

Cheers,

Stefan







On Mon, May 27, 2013 at 11:08 AM, Russell Jurney
russell.jur...@gmail.comwrote:

 Whats more, there are examples and support for Kafka, but not so much for
 Flume.


 On Mon, May 27, 2013 at 6:25 AM, Martin Kleppmann 
 mar...@rapportive.comwrote:

 I don't have experience with Flume, so I can't comment on that. At
 LinkedIn we ship logs around by sending Avro-encoded messages to Kafka (
 http://kafka.apache.org/). Kafka is nice, it scales very well and gives
 a great deal of flexibility — logs can be consumed by any number of
 independent consumers, consumers can catch up on a backlog if they're
 disconnected for a while, and it comes with Hadoop import out of the box.

 (RabbitMQ is more designed or use cases where each message corresponds to
 a task that needs to be performed by a worker. IMHO Kafka is a better fit
 for logs, which are more stream-like.)

 With any message broker, you'll need to somehow tag each message with the
 schema that was used to encode it. You could include the full schema with
 every message, but unless you have very large messages, that would be a
 huge overhead. Better to give each version of your schema a sequential
 version number, or hash the schema, and include the version number/hash in
 each message. You can then keep a repository of schemas for resolving those
 version numbers or hashes – simply in files that you distribute to all
 producers/consumers, or in a simple REST service like
 https://issues.apache.org/jira/browse/AVRO-1124

 Hope that helps,
 Martin


 On 26 May 2013 17:39, Mark static.void@gmail.com wrote:

 Yes our central server would be Hadoop.

 Exactly how would this work with flume? Would I write Avro to a file
 source which flume would then ship over to one of our collectors  or is
 there a better/native way? Would I have to include the schema in each
 event? FYI we would be doing this primarily from a rails application.

 Does anyone ever use Avro with a message bus like RabbitMQ?

 On May 23, 2013, at 9:16 PM, Sean Busbey bus...@cloudera.com wrote:

 Yep. Avro would be great at that (provided your central consumer is Avro
 friendly, like a Hadoop system).  Make sure that all of your schemas have
 default values defined for fields so that schema evolution will be easier
 in the future.


 On Thu, May 23, 2013 at 4:29 PM, Mark static.void@gmail.com wrote:

 We're thinking about generating logs and events with Avro and shipping
 them to a central collector service via Flume. Is this a valid use case?




 --
 Sean






 --
 Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.
 com



Re: Is Avro right for me?

2013-05-26 Thread Mark
Yes our central server would be Hadoop. 

Exactly how would this work with flume? Would I write Avro to a file source 
which flume would then ship over to one of our collectors  or is there a 
better/native way? Would I have to include the schema in each event? FYI we 
would be doing this primarily from a rails application.

Does anyone ever use Avro with a message bus like RabbitMQ? 

On May 23, 2013, at 9:16 PM, Sean Busbey bus...@cloudera.com wrote:

 Yep. Avro would be great at that (provided your central consumer is Avro 
 friendly, like a Hadoop system).  Make sure that all of your schemas have 
 default values defined for fields so that schema evolution will be easier in 
 the future.
 
 
 On Thu, May 23, 2013 at 4:29 PM, Mark static.void@gmail.com wrote:
 We're thinking about generating logs and events with Avro and shipping them 
 to a central collector service via Flume. Is this a valid use case?
 
 
 
 
 -- 
 Sean