Hello all,

I am new to the NiFi community but I have a good amount of experience with
ETL tools and applications that process lots of tabular data. In my
experience, JSON is only useful as the common format for tabular data if it
has a "flat" schema, in which case there aren't any advantages for JSON over
other formats such as CSV. However, I've seen lots of "CSV" files that don't
seem to adhere to any standard, so I would presume NiFi would need a rigid
schema such as RFC-4180 (http://www.rfc-base.org/txt/rfc-4180.txt).

However CSV isn't a natural way to express the schema of the rows, so JSON
or YAML is probably a better choice. There's a format called Tabular Data
Package that combines CSV and JSON for tabular data serialization:
http://dataprotocols.org/tabular-data-package/

Avro is similar, but the schema must always be provided with the data. In
the case of NiFi DataFlows, it's likely more efficient to send the schema
once as an initialization packet (I can't remember the real term in NiFi),
then the rows can be streamed individually, in batches of user-defined size,
sampled, etc.

Having said all that, there are projects like Apache Drill that can handle
non-flat JSON files and still present them in tabular format. They have
functions like KVGEN and FLATTEN to transform the document(s) into tabular
format. In the use cases you present below, you already know the data is
tabular and as such, the extra data model transformation is not needed.  If
this is desired, it should be apparent that a Streaming JSON processor would
be necessary; otherwise, for large tabular datasets you'd have to read the
whole JSON file into memory to parse individual rows.

Regards,
Matt

From:  Toivo Adams <toivo.ad...@gmail.com>
Reply-To:  <dev@nifi.apache.org>
Date:  Monday, November 2, 2015 at 5:12 AM
To:  <dev@nifi.apache.org>
Subject:  Common data exchange formats and tabular data

All,
Some processors get/put data in tabular form. (PutSQL, ExecuteSQL, soon
Cassandra) 
It would be very nice to be able use such processors in pipeline ­ previous
processor output is next processor input. To achieve this, processors should
use common data exchange format.

JSON is most widely used, it¹s simple and readable. But JSON lacks schema.
Schema can be very useful to automate data insert/update.

Avro has schema, but is somewhat more complicated and not widely used
(yet?).

Please see also:

https://issues.apache.org/jira/browse/NIFI-978

https://issues.apache.org/jira/browse/NIFI-901

Opinions?

Thanks
Toivo




--
View this message in context:
http://apache-nifi-developer-list.39713.n7.nabble.com/Common-data-exchange-f
ormats-and-tabular-data-tp3508.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.



Reply via email to