RE: c++ DataFileWriter not doing validation

2013-05-27 Thread SCHENK, Jarrad


 From: SCHENK, Jarrad 
 Sent: Tuesday, 21 May 2013 12:52 PM
 To: user@avro.apache.org
 Subject: c++ DataFileWriter not doing validation 

 Hi List, 

 I'm working with the c++ bindings to try to write data to avro files. 

 Much of the documentation assumes that the types to be written (and the code 
 to write the data) are generated using avrogencpp.

 In my case I have an existing set of type/struct hierarchies that I'm trying 
 to write so I don't want to use the output of avrogencpp  directly. Instead 
 I am producing code that is very similar to what avrogencpp outputs but is 
 adapted to suit my types.

 What I'm finding is that the c++ DataFileWriter does no validation between 
 the schema that I provide and the datums that get written. As such any 
 discrepancy between the schema and the datums that are written causes the 
 file to be corrupted and essentially unreadable.

 I see that there is a ValidatingEncoder class that can be used when 
 serialising to a memorystream (as per the Getting Started docs) but there 
 doesn't appear to be any method for using this encoder with the 
 DataFileWriter.

 Am I missing something? Is there a preferred way to make the writer do 
 validation?

 Thanks

 Jarrad

I never got any responses to this so I assume either no-one knows, or no-one 
actively uses the C++ bindings anymore?

My basic question boils down to can the DataFileWriter be made to do validation 
against the schema?

Any help much appreciated.

Jarrad
Warning:
The information contained in this email and any attached files is
confidential to BAE Systems Australia. If you are not the intended
recipient, any use, disclosure or copying of this email or any
attachments is expressly prohibited.  If you have received this email
in error, please notify us immediately. VIRUS: Every care has been
taken to ensure this email and its attachments are virus free,
however, any loss or damage incurred in using this email is not the
sender's responsibility.  It is your responsibility to ensure virus
checks are completed before installing any data sent in this email to
your computer.




Re: Is Avro right for me?

2013-05-27 Thread Russell Jurney
Whats more, there are examples and support for Kafka, but not so much for
Flume.


On Mon, May 27, 2013 at 6:25 AM, Martin Kleppmann mar...@rapportive.comwrote:

 I don't have experience with Flume, so I can't comment on that. At
 LinkedIn we ship logs around by sending Avro-encoded messages to Kafka (
 http://kafka.apache.org/). Kafka is nice, it scales very well and gives a
 great deal of flexibility — logs can be consumed by any number of
 independent consumers, consumers can catch up on a backlog if they're
 disconnected for a while, and it comes with Hadoop import out of the box.

 (RabbitMQ is more designed or use cases where each message corresponds to
 a task that needs to be performed by a worker. IMHO Kafka is a better fit
 for logs, which are more stream-like.)

 With any message broker, you'll need to somehow tag each message with the
 schema that was used to encode it. You could include the full schema with
 every message, but unless you have very large messages, that would be a
 huge overhead. Better to give each version of your schema a sequential
 version number, or hash the schema, and include the version number/hash in
 each message. You can then keep a repository of schemas for resolving those
 version numbers or hashes – simply in files that you distribute to all
 producers/consumers, or in a simple REST service like
 https://issues.apache.org/jira/browse/AVRO-1124

 Hope that helps,
 Martin


 On 26 May 2013 17:39, Mark static.void@gmail.com wrote:

 Yes our central server would be Hadoop.

 Exactly how would this work with flume? Would I write Avro to a file
 source which flume would then ship over to one of our collectors  or is
 there a better/native way? Would I have to include the schema in each
 event? FYI we would be doing this primarily from a rails application.

 Does anyone ever use Avro with a message bus like RabbitMQ?

 On May 23, 2013, at 9:16 PM, Sean Busbey bus...@cloudera.com wrote:

 Yep. Avro would be great at that (provided your central consumer is Avro
 friendly, like a Hadoop system).  Make sure that all of your schemas have
 default values defined for fields so that schema evolution will be easier
 in the future.


 On Thu, May 23, 2013 at 4:29 PM, Mark static.void@gmail.com wrote:

 We're thinking about generating logs and events with Avro and shipping
 them to a central collector service via Flume. Is this a valid use case?




 --
 Sean






-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com


Re: Is Avro right for me?

2013-05-27 Thread Stefan Krawczyk
Mark:
FWIW: I would go with Kafka if you can, it's far more flexible; we aren't
using it until it authenticates producers and consumers and provides a way
to encrypt transport  - we run in the cloud...

Anyway, so we're using Flume. For Flume, with the current out of the box
implementation, they encapsulate data in an Avro event themselves.

So it's up to you what you stick into the body of that Avro event. It could
just be json, or it could be your own serialized Avro event - and as far as
I understand serialized Avro always has the schema with it (right?).

Be aware that Flume doesn't have great support for languages outside of the
JVM. Flume's Avro source that you can communicate with via Avro RPC uses
NettyServer/NettyTransceiver underneath, and as far as I know, there's been
no updates to other Avro RPC libraries e.g. Python, Ruby that enable
talking to such an Avro RPC endpoint. So you either have to build a client
that speaks that, or create your own source.

Cheers,

Stefan







On Mon, May 27, 2013 at 11:08 AM, Russell Jurney
russell.jur...@gmail.comwrote:

 Whats more, there are examples and support for Kafka, but not so much for
 Flume.


 On Mon, May 27, 2013 at 6:25 AM, Martin Kleppmann 
 mar...@rapportive.comwrote:

 I don't have experience with Flume, so I can't comment on that. At
 LinkedIn we ship logs around by sending Avro-encoded messages to Kafka (
 http://kafka.apache.org/). Kafka is nice, it scales very well and gives
 a great deal of flexibility — logs can be consumed by any number of
 independent consumers, consumers can catch up on a backlog if they're
 disconnected for a while, and it comes with Hadoop import out of the box.

 (RabbitMQ is more designed or use cases where each message corresponds to
 a task that needs to be performed by a worker. IMHO Kafka is a better fit
 for logs, which are more stream-like.)

 With any message broker, you'll need to somehow tag each message with the
 schema that was used to encode it. You could include the full schema with
 every message, but unless you have very large messages, that would be a
 huge overhead. Better to give each version of your schema a sequential
 version number, or hash the schema, and include the version number/hash in
 each message. You can then keep a repository of schemas for resolving those
 version numbers or hashes – simply in files that you distribute to all
 producers/consumers, or in a simple REST service like
 https://issues.apache.org/jira/browse/AVRO-1124

 Hope that helps,
 Martin


 On 26 May 2013 17:39, Mark static.void@gmail.com wrote:

 Yes our central server would be Hadoop.

 Exactly how would this work with flume? Would I write Avro to a file
 source which flume would then ship over to one of our collectors  or is
 there a better/native way? Would I have to include the schema in each
 event? FYI we would be doing this primarily from a rails application.

 Does anyone ever use Avro with a message bus like RabbitMQ?

 On May 23, 2013, at 9:16 PM, Sean Busbey bus...@cloudera.com wrote:

 Yep. Avro would be great at that (provided your central consumer is Avro
 friendly, like a Hadoop system).  Make sure that all of your schemas have
 default values defined for fields so that schema evolution will be easier
 in the future.


 On Thu, May 23, 2013 at 4:29 PM, Mark static.void@gmail.com wrote:

 We're thinking about generating logs and events with Avro and shipping
 them to a central collector service via Flume. Is this a valid use case?




 --
 Sean






 --
 Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.
 com