On 08/12/2015 06:09 PM, Bryan Bende wrote:
All,
Given how popular Avro has become, I'm very interested in making progress
on providing first-class support with in NiFi. I took a stab at filling in
some of the requirements on the Feature Proposal Wiki page [1] and wanted
to get feedback from everyone to see if these ideas are headed in the right
direction.
Are there any major features missing from that list? any other
recommendations?
I'm also proposing that we create a new Avro bundle to capture the
functionality that is decided upon, and we can consider whether any of the
existing Avro-specific functionality in the Kite bundle could eventually
move to the Avro bundle. If anyone feels strongly about this, or has an
alternative recommendation, let us know.
[1]
https://cwiki.apache.org/confluence/display/NIFI/First-class+Avro+Support
Thanks,
Bryan
Thanks for putting this together, Bryan!
I have a few thoughts and observations about the proposal:
* Conversion to Avro is an easier problem than conversion from Avro.
Item #2 is to convert from Avro to other formats like CSV, but that
isn't possible for some Avro schemas. For example, Avro supports nested
lists and maps that have no good representation in CSV so we'll have to
be careful about that conversion. It is possible for a lot of data and
is definitely valuable, though.
* For #3, converting Avro records, I'd also like to see the addition of
transformation expressions. For example, I might have a timestamp in
seconds that I need to convert to the Avro timestamp-millis type by
multiplying the value by 1000.
* There are a few systems like Flume that use Avro serialization for
individual records, without the Avro file container. This complicates
behavior a bit. Your suggestion to have merge/split is great, but we
should plan on having a couple of scenarios for it:
- Merge/split between files and bare records with schema header
- Merge/split Avro files to produce different sized files
* The "extract fingerprint" processor could be more general and populate
a few fields from the Avro header:
- Schema definition (full, not fp)
- Schema fingerprint
- Schema root record name (if schema is a record)
- Key/value metadata, like compression codec
* It looks like #7, evaluate paths, and #8, update records, are intended
for the case where the content is a bare Avro record. I'm not sure that
evaluating paths would work for Avro files.
* For the update records processor, this is really similar to the
processor to convert between Avro schemas, #3. I suggest merging the two
and making it easy to work with either a file or a record via
record-level callback. This would be useful elsewhere as well. Maybe
tell the difference between file and record by checking for the filename
attribute?
On the subject of where these processors go, I'm not attached to them
being in the Kite bundle. It would probably be better to separate that
out. However, there are some specific features in the Kite bundle that I
think are really valuable:
- Use a schema file from a HDFS path (requires Hadoop config)
- Use the current schema of a dataset/table
Those make it possible to update a table schema, then have that change
propagate to the conversion in NiFi. So if I start receiving a new field
in my JSON data, I just update a table definition and then the processor
picks up the change either automatically or with a restart.
The other complication is that the libraries for reading JSON and CSV
(and from an InputFormat if you are interested) are in Kite, so you'll
have a Kite dependency either way. We can look at separating the support
into stand-alone Kite modules or moving it into the upstream Avro project.
Overall, this looks like a great addition!
rb
--
Ryan Blue
Software Engineer
Cloudera, Inc.