On 08/12/2015 06:09 PM, Bryan Bende wrote:
All,

Given how popular Avro has become, I'm very interested in making progress
on providing first-class support with in NiFi. I took a stab at filling in
some of the requirements on the Feature Proposal Wiki page [1] and wanted
to get feedback from everyone to see if these ideas are headed in the right
direction.

Are there any major features missing from that list? any other
recommendations?

I'm also proposing that we create a new Avro bundle to capture the
functionality that is decided upon, and we can consider whether any of the
existing Avro-specific functionality in the Kite bundle could eventually
move to the Avro bundle. If anyone feels strongly about this, or has an
alternative recommendation, let us know.

[1]
https://cwiki.apache.org/confluence/display/NIFI/First-class+Avro+Support

Thanks,

Bryan

Thanks for putting this together, Bryan!

I have a few thoughts and observations about the proposal:

* Conversion to Avro is an easier problem than conversion from Avro. Item #2 is to convert from Avro to other formats like CSV, but that isn't possible for some Avro schemas. For example, Avro supports nested lists and maps that have no good representation in CSV so we'll have to be careful about that conversion. It is possible for a lot of data and is definitely valuable, though.

* For #3, converting Avro records, I'd also like to see the addition of transformation expressions. For example, I might have a timestamp in seconds that I need to convert to the Avro timestamp-millis type by multiplying the value by 1000.

* There are a few systems like Flume that use Avro serialization for individual records, without the Avro file container. This complicates behavior a bit. Your suggestion to have merge/split is great, but we should plan on having a couple of scenarios for it:
  - Merge/split between files and bare records with schema header
  - Merge/split Avro files to produce different sized files

* The "extract fingerprint" processor could be more general and populate a few fields from the Avro header:
  - Schema definition (full, not fp)
  - Schema fingerprint
  - Schema root record name (if schema is a record)
  - Key/value metadata, like compression codec

* It looks like #7, evaluate paths, and #8, update records, are intended for the case where the content is a bare Avro record. I'm not sure that evaluating paths would work for Avro files.

* For the update records processor, this is really similar to the processor to convert between Avro schemas, #3. I suggest merging the two and making it easy to work with either a file or a record via record-level callback. This would be useful elsewhere as well. Maybe tell the difference between file and record by checking for the filename attribute?

On the subject of where these processors go, I'm not attached to them being in the Kite bundle. It would probably be better to separate that out. However, there are some specific features in the Kite bundle that I think are really valuable:
  - Use a schema file from a HDFS path (requires Hadoop config)
  - Use the current schema of a dataset/table

Those make it possible to update a table schema, then have that change propagate to the conversion in NiFi. So if I start receiving a new field in my JSON data, I just update a table definition and then the processor picks up the change either automatically or with a restart.

The other complication is that the libraries for reading JSON and CSV (and from an InputFormat if you are interested) are in Kite, so you'll have a Kite dependency either way. We can look at separating the support into stand-alone Kite modules or moving it into the upstream Avro project.

Overall, this looks like a great addition!

rb


--
Ryan Blue
Software Engineer
Cloudera, Inc.

Reply via email to