Re: [DISCUSS] Feature proposal: First-class Avro Support

Ryan Blue Sat, 15 Aug 2015 14:46:05 -0700

On 08/12/2015 06:09 PM, Bryan Bende wrote:

All,


Given how popular Avro has become, I'm very interested in making progress
on providing first-class support with in NiFi. I took a stab at filling in
some of the requirements on the Feature Proposal Wiki page [1] and wanted
to get feedback from everyone to see if these ideas are headed in the right
direction.

Are there any major features missing from that list? any other
recommendations?

I'm also proposing that we create a new Avro bundle to capture the
functionality that is decided upon, and we can consider whether any of the
existing Avro-specific functionality in the Kite bundle could eventually
move to the Avro bundle. If anyone feels strongly about this, or has an
alternative recommendation, let us know.

[1]
https://cwiki.apache.org/confluence/display/NIFI/First-class+Avro+Support

Thanks,

Bryan


Thanks for putting this together, Bryan!

I have a few thoughts and observations about the proposal:

* Conversion to Avro is an easier problem than conversion from Avro.Item #2 is to convert from Avro to other formats like CSV, but thatisn't possible for some Avro schemas. For example, Avro supports nestedlists and maps that have no good representation in CSV so we'll have tobe careful about that conversion. It is possible for a lot of data andis definitely valuable, though.

* For #3, converting Avro records, I'd also like to see the addition oftransformation expressions. For example, I might have a timestamp inseconds that I need to convert to the Avro timestamp-millis type bymultiplying the value by 1000.

* There are a few systems like Flume that use Avro serialization forindividual records, without the Avro file container. This complicatesbehavior a bit. Your suggestion to have merge/split is great, but weshould plan on having a couple of scenarios for it:

  - Merge/split between files and bare records with schema header
  - Merge/split Avro files to produce different sized files

* The "extract fingerprint" processor could be more general and populatea few fields from the Avro header:

  - Schema definition (full, not fp)
  - Schema fingerprint
  - Schema root record name (if schema is a record)
  - Key/value metadata, like compression codec

* It looks like #7, evaluate paths, and #8, update records, are intendedfor the case where the content is a bare Avro record. I'm not sure thatevaluating paths would work for Avro files.

* For the update records processor, this is really similar to theprocessor to convert between Avro schemas, #3. I suggest merging the twoand making it easy to work with either a file or a record viarecord-level callback. This would be useful elsewhere as well. Maybetell the difference between file and record by checking for the filenameattribute?

On the subject of where these processors go, I'm not attached to thembeing in the Kite bundle. It would probably be better to separate thatout. However, there are some specific features in the Kite bundle that Ithink are really valuable:

  - Use a schema file from a HDFS path (requires Hadoop config)
  - Use the current schema of a dataset/table

Those make it possible to update a table schema, then have that changepropagate to the conversion in NiFi. So if I start receiving a new fieldin my JSON data, I just update a table definition and then the processorpicks up the change either automatically or with a restart.

The other complication is that the libraries for reading JSON and CSV(and from an InputFormat if you are interested) are in Kite, so you'llhave a Kite dependency either way. We can look at separating the supportinto stand-alone Kite modules or moving it into the upstream Avro project.


Overall, this looks like a great addition!

rb


--
Ryan Blue
Software Engineer
Cloudera, Inc.

Re: [DISCUSS] Feature proposal: First-class Avro Support

Reply via email to